DOES THE HALF ADVERSARIAL ROBUSTNESS REPRE-SENT THE WHOLE? IT DEPENDS ... A THEORETICAL PERSPECTIVE OF SUBNETWORK ROBUSTNESS

Abstract

Adversarial robustness of deep neural networks has been studied extensively and can bring security against adversarial attacks/examples. However, adversarially robust training approaches require a training mechanism on the entire deep network which can come at the cost of efficiency and computational complexity such as runtime. As a pilot study, we develop in this paper a novel theoretical framework that aims to answer the question of how can we make a whole model robust to adversarial examples by making part of a model robust? Toward promoting subnetwork robustness, we propose for the first time a new concept of semirobustness, which indicates adversarial robustness of a part of the network. We provide a theoretical analysis to show that if a subnetwork is robust and highly dependent to the rest of the network, then the remaining layers are also guaranteed to be robust. To guide the empirical investigation of our theoretical findings, we implemented our method at multiple layer depths and across multiple common image classification datasets. Experiments demonstrate that our method, with sufficient dependency between subnetworks, successfully utilizes subnetwork robustness to match fully-robust models' performance across AlexNet, VGG16, and ResNet50 benchmarks, for attack types FGSM, I-FGSM, PGD, C&W, and AutoAttack.

1. INTRODUCTION

Deep neural networks (DNNs) have been highly successful in computer vision, particularly in image classification tasks, speech recognition, and natural language processing where they can often outperform human abilities Mnih et al. (2015) ; Radford et al. (2015) ; Goodfellow et al. (2016) . Despite this, the reliability of deep learning algorithms is fundamentally challenged by the existence of the phenomenon of "adversarial examples", which are typically natural images that are perturbed with random noise such that the networks misclassify them. In the context of image classification an extremely small perturbation can change the label of a correctly classified image Szegedy et al. (2014) ; Goodfellow et al. (2014) . For this reason, adversarial examples present a major threat to the security of deep-learning systems; however, a robust classifier can correctly label adversarially perturbed images. For example, an adversary could alter images of the road to fool a self-driving car's neural network into misclassifying traffic signs Papernot et al. (2016a) , reducing the car's safety, but a robust network would detect and reject the adversarial inputs Ma et al. (2018) ; Biggio et al. (2013) . The problem of finding perturbed inputs, known as adversarial attacks, has been studied extensively Kurakin et al. (2017) ; Sharif et al. (2016) ; Brown et al. (2017) ; Eykholt et al. (2018) . To handle adversarial attacks, two major solutions have been studied: (1) Efficient methods to find adversarial examples Su et al. (2019) ; Laidlaw & Feizi (2019) ; Athalye et al. (2018) ; Liu et al. (2016) ; Xie et al. (2017) ; Akhtar & Mian (2018) , (2) Adversarial training to make deep neural networks more robust against adversarial attacks Madry et al. (2018) ; Tsipras et al. (2019) ; Gilmer et al. (2019) ; Ilyas et al. (2019) ; Papernot et al. (2016b) . The adversarial perturbations may be applied to the input or to the network's hidden layers Goodfellow et al. (2014) ; Szegedy et al. (2014) and it has been show that this strategy is effective at improving a network's robustness Goodfellow et al. (2014) . Several theories have been developed to explain the phenomenon of adversarial examples Raghunathan et al. (2018) ; Xiao et al. (2019) ; Cohen et al. (2019) ; Shamir et al. (2019) ; Fawzi et al. (2016) ; Carlini & Wagner (2017) ; Weng et al. (2018) ; Ma et al. (2018) . Previously Ilyas et al. (2019) investigated adversarial robustness from a theoretical perspective. The authors address "useful, non-robust features": useful because they help a network improve its accuracy, and non-robust because they are imperceptible to humans and thus not intended to be used for classification. Normally, a model considers robust features to be about as important as non-robust ones, yet adversarial examples encourage it to rely on only non-robust features. Ilyas et al. (2019) introduces a framework to explain the phenomenon of adversarial vulnerability . A feature f is considered a "ρ-useful feature" if it is correlated with the true label in the dataset. Similarly, "γ-robustly useful features" are ρ-useful for a set of adversarial perturbations. While Ilyas et al. (2019) constitutes a fundamental advance in the theoretical understanding of adversarial examples, and opens the way to a thorough theoretical characterization of the relation between network architecture and robustness to adversarial perturbations, little attention has been paid to how robustness throughout the network is guaranteed and whether adversarial training must be applied to the entire network. In this paper, we develop a new theoretical framework that monitors the robustness across the layers in a DNN and explains that if the early layers are adversarially trained and are sufficiently connected with the rest of the network, then adversarial robustness of the latter layers is obtained, here by connectivity we mean the early layers are highly dependent to the latter layers. All of these findings raise a fundamental question: How can we make a whole model robust to adversarial inputs by making part of the model robust? In addition, the vulnerability of models trained using standard methods to adversarial perturbations makes it clear that the paradigm of adversarially robust learning is different from the classic learning setting. In particular, we already know that robustness comes at the cost of computationally expensive training methods (more training time) Zhang et al. (2019) , as well as the potential need for more training data and memory capacity Schmidt et al. (2018) . Hence, one notable challenge in adversarially robust learning is computational complexity while maintaining desired performance. To this end, by exploiting the possibility that subnetworks can be robust to adversarial attacks, we propose a novel approach that aims to theoretically analyze adversarial robustness guarantees in a network by adversarially training only a subset of layers. This work will also pioneer the new concept of "semirobustness" which indicates adversarial robustness of a part of the network. This includes a new perspective of adversarial perturbations and a novel theoretical framework that explains theories for the following claim: If a subnetwork is robust and highly dependent to the rest of the network and passes sufficient connectivity toward the last layer, then the remaining layers are also guaranteed to be robust.

Contributions

To summarize, our contributions in this paper are: (1) We introduce a novel concept of semirobustness in subnetworks. We show that a subnetwork is semirobust if and only if all layers within the subnetwork are semirobust. (2) For the first time we provide a theoretical framework and prove that under some assumptions if the first part of the network is semirobust then the second part of the network's robustness is guaranteed. (3) Experimentally, we demonstrate that given sufficient mutual dependency between subnetworks, our method displays the same adversarial robustness of a network as compared to regular adversarial training.

2. SUBNETWORK ROBUSTNESS

Notations We assume that a given DNN has a total of n layers where, F (n) is a function mapping the input space X to a set of classes Y, i.e. F (n) 1) is the first part of the network up to layer j. Denote σ (l) the activation function in layer l and π(y) the prior probability of class label y ∈ Y. Let f (l) be the l-th layer of F (n) , as f (l) (x l-1 ) = σ (l) (w (l) x l-1 + b (l) ), where σ (l) is the activation function. In this section, we define the notion of a Semirobust Subnetwork. We discuss semirobustness more in Section 2.1. : X → Y ; f (l) is the l-th layer of F (n) ; F (i,j) := f (j) • . . . • f (i) is a subnetwork which is a group of consecutive layers f (i) , . . . , f (j) ; F (j) := F (1,j) = f (j) • . . . • f ( Definition 1 (Semirobust Subnetwork) Suppose input X and label y are samples from joint distribution D. For a given distribution D, a subnetwork F (j) is called γ j -semirobust if there exists a mapping function G j : L j → Y such that E (X,y)∼D inf δ∈Sx y • G j • F (j) (X + δ) ≥ γ j , for an appropriately defined set of perturbations S x . In (1), G j is a non-unique function mapping layer f (j) to class set Y, and γ j is a constant denoting the correlation between y and F (j) . Note that G j is necessary if the dimensionality of F (j) does not match that of y, but if F (j) = F (n) , the semirobust definition becomes standard γ-robustness as defined in Ilyas et al. (2019) . To define semirobustness for a single layer f (j) , in (1) we simply replace f (j) in F (j) and K j-1 • (X + δ) in X + δ, where K j-1 is mapping function K j-1 : X → L j-1 . In this paper to avoid confusion, we use X + δ for layer semirobustness as input as well. Throughout this paper, we assume that the network F (n) is a useful network i.e. for a given distribution D, the correlation between F (n) and true label y, E (X,y)∼D y • F (n) (X) is highest in expectation in optimal performance. Intuitively, a highly useful network F (n) minimizes the classification loss E (X,y)∼D L(X, y) that is -E (X,y)∼D y • b + F (n) ∈F (n) w F (n) F (n) (X) , where w F (n) is the weight vector and F (n) is the set of n-th layer networks. Definition 1 raises valid questions regarding the relationship between a subnetwork and its associated layers' robustness. We show this relationship under the following thoerem. Theorem 1 The subnetwork F (j) is γ j -semirobust if and only if every layer of F (j) , i.e. f (j) , f (j-1) , . . . , f (1) , is also semirobust with bound parameters γ j , . . . , γ 1 respectively. Theorem 1 is a key point used to support our main claims on the relationship between layer-wise and subnetwork robustness, and its proof is provided as supplementary materials (SM). Next, we show that under a strong dependency assumption between layers the robustness of subnetworks are guaranteed.

2.1. SEMIROBUSTNESS GUARANTEES

In this section, we provide theoretical analysis to explain how dependency between layers of subnetworks promotes semirobustness and eliminates the entire-network adversarial training requirement. Non-linear Probabilistic Dependency (Mutual Information): Among various probabilistic dependency measures, in this paper, we adopt an information-theoretic measure called mutual information (MI): a measure of the reduction in uncertainty about one random variable by knowing about another. Formally, it is defined as follows: Let X and Z be Euclidean spaces, and let P XZ be a probability measure in the space X × Z. Here, P X and P Z define the marginal probability measures. The mutual information (MI), denoted by I(X; Z), is defined as, I(X; Y ) = E P X P Z g dP XZ dP X P Z , where dP XZ dP X P Z is the Radon-Nikodym derivative, g : (0, ∞) → R is a convex function, and g(1) = 0. Note that when dP XY dP X P Y → 1, then I → 0. Using (3), the MI measure between two layers f (i) and f (j) with joint distribution P ij and marginal distributions P i , P j respectively is given as I(f (i) ; f (j) ) = E PiPj g dP ij dP i P j . ( ) The concept of MI is integral to the most important theory in our theoretical framework through the assumptions below. Assumptions: Let G a : L a → Y be a function mapping layer f (a) to a label y ∈ Y, and let G j : L j → Y be a function mapping layer f (j) to a label y ∈ Y. Let g δ = f (a) (X + δ) and h δ,j = f (j) (X + δ) for δ ∈ S x (perturbation set). Note that g δ = h δ,a . A1: The class-conditional MI between h δ,j-1 and h δ,j is at least hyperparameter ρ j ≥ 0, i.e. y π(y)I (h δ,j-1 ; h δ,j |y) ≥ ρ j (5) A2: There exists a constant U j ≥ 0 such that for all δ ∈ S: E p(h δ,j-1 ,h δ,j ,y) p(h δ,j-1 , h δ,j |y) p(h δ,j-1 |y)p(h δ,j |y) ≤ U j , E p(h δ,j-1 ,h δ,j ,y) [y • (G j • h δ,j -G j-1 • h δ,j-1 )] ≥ 1 + U j , where p(h δ,j-1 , h δ,j , y) is the joint probability of random triple (h δ,j-1 , h δ,j , y). Theorem 2 Let f a be a γ a -semirobust subnetwork equivalent to F (a) , and let f b be the subnetwork F (a+1,n) and for j = a + 1, . . . , n, assumptions A1 and A2 holds true. Then f b is γ b -semirobust. In Theorem 2, γ b ≤ γ a + b j=a+1 ρ j . Note that the constant U j does not depend on γ a , γ b , and ρ j . This theorem is an extension of the following lemma, and the proofs of both are found in the SM. Lemma 1 Let F (n-1) be a γ n-1 -semirobust subnetwork. Let g δ = f (n-1) (X + δ) and h δ = f (n) (X + δ) for δ ∈ S x . Let G n-1 : L n-1 → Y be a function mapping layer g to the network's output y ∈ Y. Under the following assumptions f (n) is γ n -semirobust: • B1: The MI between f (n-1) and f (n) is at least hyperparameter ρ ≥ 0, i.e. y π(y)I (g δ ; h δ |y) ≥ ρ. • B2: There exists a constant U ≥ 0 such that for all δ ∈ S: E p(g δ ,h δ ,y) p(g δ , h δ |y) p(g δ |y)p(h δ |y) ≤ U, and E p(g δ ,h δ ,y) [y • (h δ -G n-1 • g δ )] ≥ 1 + U. Note that in Lemma 1, γ n ≤ γ n-1 + ρ, and assumptions B1 and B2 are particular cases of A1 and A2, when a = n -1. Intuition: Let IF(.) determine the information flow passing through layers in the network F (n) . Intuitions from the IF literature would advocate that in a feed-forward network if the learning information is preserved up to a given layer, one can utilize knowledge of this information flow in the next consecutive layer's learning process due to principle F (i,j) = f (j) • F (i,j-1) , and consequently IF (i,j) ≈ IF (j) • IF (i,j-1) . This is desirable as in practice training the subnetwork requires less computation and memory usage. This explains that under the assumption of the strong connection between j-th and j -1-th layers, the information automatically passes throughout the later layers, and subnetwork training returns sufficient solutions for task decision-making. To better characterize the measure of information flow, we employ a non-linear and probabilistic dependency measure that determines the mutual relationship between layers and how much one layer tells us about the other one. An important takeaway from Theorem 2 (and Lemma 1) is that a strong non-linear mutual connectivity between subnetworks guarantees that securing only the robustness of the first subnetwork ensures information flow throughout the entire network. Linear Connectivity: To provably show that our theoretical study in Theorem 2 is satisfied for the linear connectivity assumption between subnetworks, we provide a theory that investigates the scenario when the layers in the second half of the network are a linear combination of the layers in the first subnetwork. Theorem 3 Let f a be a γ a -semirobust subnetwork equivalent to F (a) , and let f b be the subnetwork F (a+1,n) . If for j = a + 1, . . . , n, f (j) = j-1 i=1 λ T ij .f (i) , where λ ij is a map L i → L j and a matrix of dimensionality L i × L j , then f b is γ b -semirobust where γ b = γ a (n -1 -a)(n -a) 2 . This theorem shows that when the connectivity between layers in f a and f b is linear, we achieve the semirobustness property for the subnetwork f b . Importantly, note that linear combination multipliers determine the Pearson correlation between layers given the constant variance of the layers. This is because if f (j) = λ ij f (i) , then Corr(f (j) , f (i) ) = λ ij var(f (i) ). Theorem 3 is an extension of the lemma 2. Detailed proof and accompanying experiments are provided in the SM. Lemma 2 Let the last layer f (n) be a linear combination of f (n-1) , . . . , f (1) , expressed as f (n) = n-1 i=1 λ T i • f (i) , where λ i is a map L i → L n and a matrix of dimensionality L i × L n . If F (n-1) is γ-semirobust, then f (n) is γ n -semirobust where γ n = n-1 i=1 γ i . Question: At this point, a valid argument could be how the performance of a network differs under optimal full-network robustness, (f * a , f * b ) and subnetwork robustness (f * a , f b ). Does the difference between performance have any relationship with the weight difference of subnetworks f * b and f b ? This question is investigated in the next section by analyzing the difference between loss function of the networks (f * a , f * b ) and (f * a , f b ).

2.2. FURTHER THEORETICAL INSIGHTS

Let ω * be the convergent parameters after training has been finished for the network F * (n) := (f * a , f * b ) , that is adversarially robust against a given attack. Let ω * be the convergent parameters for network (f * a , f b ), that is adversarially semirobust against the attack. This means that only the first half of the network is robust against attacks. Let ω * b , ω b , and ω * a be weights of networks f * b , f b , and f * a , respectively. Recall the loss function (2), and remove offset b without loss of generality. Define ℓ(ω) := - F ∈F w F • F (n) (X), therefore the loss function in (2) becomes E (X,Y )∼D {L(F (n) (X), Y )} = E (X,Y )∼D {Y • ℓ(ω)} and ω * := argmin ω E (X,Y )∼D {Y • (ℓ(ω))} , where ℓ is defined in (6). Definition 2 (Performance Difference) Suppose input X and task Y have joint distribution D. Let F (n) := (f * a , f b ) ∈ F be the network with n layers when the subnetwork f * a is semirobust. The performance difference between robust F * (n) := (f * a , f * b ) and semirobust F (n) is defined as d(F * (n) , F (n) ) := E (X,Y )∼D L(F * (n) (X), Y ) -L( F (n) (X), Y ) . ( ) Let δ(ω * | ω * ) := ℓ(ω * ) -ℓ t ( ω * ). The performance difference (7) is the average of δ: d(F * (n) , F (n) ) = E (X,Y )∼D [Y • δ(ω * | ω * )] = E (X,Y )∼D Y • ℓ(ω * ) -ℓ( ω * ) . Using Taylor approximation of ℓ around ω * : ℓ( ω * ) ≈ ℓ(ω * ) + ( ω * -ω * ) T ∇ℓ(ω * ) + 1 2 ( ω * -ω * ) T ∇ 2 ℓ(ω * )( ω * -ω * ), where ∇ℓ(ω * ) and ∇ 2 ℓ(ω * ) are gradient and Hessian for loss ℓ at ω * . Since ω * is the convergent points of (f * a , f * b ), then ∇ℓ(ω * ) = 0, this implies ℓ( ω * ) -ℓ(ω * ) ≈ 1 2 ( ω * -ω * ) T ∇ 2 ℓ(ω * )( ω * -ω * ) ≤ 1 2 λ max ∥ ω * -ω * ∥ 2 , ( ) where λ max is the maximum eigenvalue of ∇ 2 ℓ(ω * ). In (10) we can write ∥ ω * -ω * ∥ 2 = ∥ ω b -ω * b ∥ 2 holds because ω * = (ω * a , ω b ) and ω * = (ω * a , ω * b ). Note that here the weight matrices ω * and ω * are reshaped. Using the loss function E (X,Y )∼D {Y • ℓ(ω)}, we have E (X,Y )∼D Y • ℓ( ω * ) -ℓ(ω * ) ≤ 1 2 E (X,Y )∼D Y • λ max ∥ ω b -ω * b ∥ 2 . ( ) This explains that the performance difference (8) between networks F * (n) and F (n) is upper bounded by the L 2 norm of weight difference of f * b and f b i.e. ω b -ω * b . Alternatively, using Cauchy-Schwarz inequality, we have E (X,Y )∼D Y • ℓ( ω * ) -ℓ(ω * ) ≤ E (X,Y )∼D Y ∥f (n) (x; ω * ) -f (n) (x; ω * )∥ 2 , where f (n) is the last layer of the network. Recall (8) from Lee et al. (2021) . As ω * and ω * are the weights of network on (f * a , f * b ) and (f * a , f b ), we have ∥f (n) (x; ω * ) -f (n) (x; ω * )∥ 2 ≤ ∥ ω * b -ω * b ∥ F ∥σ (f a (x, ω * a )) ∥ 2 . ( ) next, we assume the activation function σ is Lipschitz continous i.e. for any u and v there exist constant C σ s.t. |σ(u) -σ(v)| ≤ C σ |u -v|. Next, assume the activation function is satisfied in σ(0) = 0. Further by assuming that ∥x∥ 2 is bounded by C x and by using peeling procedure, we get: ∥f (n) (x; ω * ) -f (n) (x; ω * )∥ 2 ≤ C x,σ ∥ ω * b -ω * b ∥ F j∈a ∥ω * (j) ∥ F , here ω * (j) is the weight matrix of layer j-th in f * a and C x,σ = C x C σ . Combining ( 15) and ( 14) we provide the upper bound: E (X,Y )∼D Y • ℓ( ω * ) -ℓ(ω * ) ≤ E (X,Y )∼D Y • C x,σ (ω * a )∥ ω * b -ω * b ∥ F , where C x,σ (ω * a ) = C x,σ j∈a ∥ω * (j) ∥ F . This alternative approach validates the result shown in (11) and aligns with the conclusion that the performance difference between robust and semirobust networks is highly related to their weight differences. In this section we proved two bounds for performance difference defined in (8).

3. EXPERIMENTS AND ANALYSES

To confirm our theoretical findings and experimentally validate Theorems 1-3, we test our method at multiple layer depths, and across multiple common image classification networks trained on CIFAR-10 Krizhevsky et al. ( 2009 

3.1. EXPERIMENTAL SETUP

To guide the empirical investigation of our theoretical findings, we consider attack models, MI estimator, and adversarial training settings as follows. Attack Models: The most common threat model used when generating adversarial examples is the additive threat model. Let X = (X 1 , . . . , X d ), where each X i ∈ X is a feature of X. In an additive threat model, we assume adversarial example X δ = (X 1 + δ 1 , . . . , X d + δ d ), i.e., X δ = X ⊕ δ, X δ = X + δ where δ = (δ 1 , . . . , δ d ). Under this attack model, perceptual similarity is usually enforced by a bound on the norm of δ, ∥δ∥ ≤ ϵ. Note that a small ϵ is usually necessary because otherwise, the noise on the input could be visible. We use some of the most common additive attack models: the Fast Gradient Sign Method (FGSM) Goodfellow et al. (2014) ; Szegedy et al. (2014) , iterative FGSM (I-FGSM) Kurakin et al. (2017) , Progressive Gradient Descent (PGD) Madry et al. (2018) , Carlini & Wagner (CW) Carlini & Wagner (2017) , and AutoAttack Croce & Hein (2020) . We use ϵ = 8 255 , 16 255 , and 32 255 . For iterative attacks we use an ϵ step of 1 255 and a number of iterations equal to min(4 + ϵ, 1.25 * ϵ) for 10, 20, and 36 iterations for the respective ϵ values, as suggested by Kurakin et al. (2018) . Attacks use an L ∞ -norm with the exception of C&W, which uses L 2 -norm. Additional details can be found in the SM.

MI Estimation:

We use a reduced-complexity MI estimator called the ensemble dependency graph estimator (EDGE) Noshad et al. (2019) . The estimator combines randomized locality-sensitive hashing (LSH), dependency graphs, and ensemble bias-reduction methods. We chose EDGE because it has been shown that it achieves optimal computational complexity O(n), where n is the sample size. It is thus significantly faster than its plug-in competitors Kraskov et al. (2004); Moon et al. (2017) ; Noshad et al. (2017) . In addition to fast execution, EDGE has an optimal parametric MSE rate of O(1/n) under a specific condition. Adversarial Training: Adversarial training is an approach to making models more robust to adversarial attacks by producing adversarial examples and inserting them into the training data. Given adversarial examples in the original input, we focus on the min-max formulation of adversarial training that uses standard training on a classifier by minimizing a loss function that decreases with the correlation between the weighted combination of the features and the label Goodfellow et al. (2015) ; Madry et al. (2018) , min θ (x,y)∼D max δ L θ (x + δ, y) .

3.2. LEARNING HYPERPARAMETER ρ

A key point in the claim of Theorem 2 is to determine the hyperparameter ρ a+1 that bound the dependency between last layer in subnetwork f a := F (a) and first layer in subnetwork f b := F (a+1,n) and hyperparameters ρ a+2 , . . . , ρ n that bound dependencies between consecutive layers in f b . Within the experimental results we denote these values as ρ n , . . . , ρ a+1 , where ρ n corresponds to the last pair of layers in f b . We have devised a novel adversarial training algorithm to determine these ρ-values that learns hyperparameters and supports that subnetwork robustness guarantees network robustness. 

. , n do

Compute I j,t as given in ( 5) for all consecutive layers in (f * a , f b ), then store I j,t end end Acc= largest Acc e t ρ j = smallest I j,t for j = a + 1, . . . , n Report ρ a+1 , . . . , ρ n and Acc This procedure labeled Algorithm 1, assumes that the mutual dependency between the two parts of a network F (n) is based on their MI measure. To retrieve baseline results, this method first performs standard ("regular") training of the whole network with the original dataset, and then the same training is done with adversarial examples of that set. The network's two halves are denoted f a and f b if regularly trained, or f * a and f * b if adversarially trained. In the next stage, the algorithm runs T trials, each of which does adversarial training on f b for E epochs while f * a is frozen. The second part of F (n) after being trained for an epoch is labeled f b . Ideally, the current training accuracy of the network Acc e t should approach Acc * within a small value of k, at which point the training in the current trial ends. Next, the class conditional MI, I j,t := y π(y)I(f (j-1) ; f (j) |y), between each pair of consecutive layers from f (a) to f (n) , is calculated. As the trials progress the largest testing accuracy achieved ( Acc) is updated, along with the corresponding trials' I j,t values (ρ a+1 to ρ n ). After adversarial training ends, these results are reported for the trial which achieves the highest adversarial testing accuracy Acc. We provide the hyperparameter settings for Algorithm 1 in the SM.

3.3. PIECE-WISE ADVERSARIAL ROBUSTNESS GUARANTEED

The experimental results support our claims in Theorem 2.The tests span AlexNet, VGG16, and ResNet50 architectures on CIFAR-10, CIFAR-100, and Imagenette datasets. As the network always undergoes the same procedure for standard training, the regular test accuracies are the same for all f b sizes. If Theorem 2 is correct, then despite f * a being frozen when training f b , the network should still be robust to adversarial examples due to the mutual dependencies within it. We see in Table 1 that the f b network training frequently approaches within 1 -2% of Acc * across varying combinations of networks and datasets. For this experiment the number of trainable (e.g. convolutional or linear) layers in f b varies by network with values of 4, 12, and 16, to ensure that f b comprises a large portion of the respective networks. For this table all data was attacked with AutoAttack using ϵ = 8 255 . We report the adversarial test accuracies of the fully robust model (Acc * ), the semirobust network (f * a , f b ) denoted (Acc sr ), and the the network (f * a , f b ) denoted Acc. 9.5 82.3 -6.9 3.05 6.17 6.61 6.58 0.00 , each prior layer of f (n-x) (where x is the x-axis value) tends to show higher ρ values, leveling off at a certain depth. An exception for this tends to occur when training f b fails to converge, sometimes resulting in ρ values close to 0 in the early layers of f b . This can be seen in Fig. 1 for ResNet50 on Imagenette. The accompanying data table in the SM reflects that this particular run of ResNet50 failed to achieve an Acc similar to Acc * . Such occurrences support the idea that a sufficient ρ a+1 is required to achieve subnetwork robustness of f b . Effects of Dataset, Network, and Attack Type on ρ: In order to investigate the effects of dataset, network type, and attack type on the observed ρ values, we ran a series of experiments for Algorithm 1 with certain hyperparameters held constant which are found in the SM along with additional analysis. We observe that attack type and network depth lack readily apparent trends with the values of ρ for each layer. We do observe a clear trend where the range of values of ρ obtained across the layers of f b is smallest for CIFAR-100 and largest for Imagenette. Experimental Analysis We observe in our experiments that changes in the dataset impact the values of ρ. CIFAR-100 consistently reported the lowest values of ρ for a given layer while Imagenette reported the highest, reflecting the network's accuracy on these datasets. A likely reason for this is that for a task which the network has accurately learned, it displays high MI between each layer to facilitate this high performance. Similarly, we show that for deeper layers in the network within f b , ρ tends to take higher values. This may indicate that deeper networks provide a better flow of information which enables f a+1 to readily learn to utilize the features in f a . Our results indicate that when subnetwork training fails to reproduce Acc * , ρ a+1 is often ≈ 0, indicating that the network isn't properly learning to pass information from the subnetwork f * a . We report no clear trends between ρ and any of the attack types or magnitudes used here. This, coupled with the frequent matching of performance when compared to Acc * , indicates that our method is largely orthogonal to each attack type, resulting in comparable performance while leveraging the robustness of the first subnetwork. An important paper that studies adversarial robustness from a theoretical perspective is by Ilyas et al. (2019) , who claim that adversarial examples are "features" rather than bugs. The authors state that a network's being vulnerable to adversarial attacks "is a direct result of [its] sensitivity to well-generalizing features in the data". Specifically, deep neural networks are learning what they call "useful, non-robust features": useful because they help a network improve its accuracy, and non-robust because they are imperceptible to humans and thus not intended to be used for classification. Consequently, a model considers robust features to be about as important as non-robust ones, yet adversarial examples encourage it to rely on only non-robust features. Ilyas et al. (2019) introduces a framework to explain the phenomenon of adversarial vulnerability. Rather than focusing on which features the model is learning, our method's focus is on proving a probabilistic close-form solution to determine the minimal subnetwork which needs to be adversarially trained in order to confer full-network adversarial robustness.

4. RELATED WORK

More recently some attention has been given to the adversarially robust subnetworks through methods following the concept from Frankle & Carbin (2018) including Peng et al. (2022) and Fu et al. (2021) . Although these works are also interested in robust subnetworks, the focus is often more empirical, or focuses on the robustness of the subnetwork itself, rather than what we do which is to investigate how other subnetworks can benefit from that semirobustness. Applying the theory outlined here to such methods could provide an interesting avenue for Continual Learning, where robust subnetworks are sequentially identified and built up over a series of tasks by incorporating the theory behind semirobustness.

5. CONCLUSION

Discussion We have introduced here the notion of semirobustness, when a part of a network is adversarially robust. The investigation of this characteristic has interesting applications both theoretically and empirically. We prove that if a subnetwork is semirobust and its layers have a high dependency with later layers the second subnetwork is robust. This has been proven under non-linear dependency (MI) and linear connectivity between layers in two subnetworks. As our method makes no assumptions on how the subnetwork is adversarially trained, it is expected to serve as an orthogonal approach to existing adversarial training methods. This is supported by our experimental observations that attack type had little impact on the trends seen for ρ. We additionally show through our experiments that given a semirobust network where fewer than half of the layers are adversarially robust (as with VGG16 when f b contains the last 12 trainable layers), training the remaining non-robust portion for a small number of epochs can nearly reproduce the robustness of a network which is fully-robust for the same attack. Beyond the potential for subnetwork training to be used alongside other adversarial training methods, the theory outlined here may help provide tools for other methods which rely on training the full network to theoretically challenge this constraint by finding ways to leverage semirobustness within their network. Looking ahead One open question here is that how we can determine the complexity of the semirobust subnetwork performance in terms of convergence rate. The answer to this question involves investigating a bound on performance difference as a function of dependency between layers (ρ). In addition, although the trend observed between ρ and dataset is consistent and clear, it's less apparent the reason. The narrower range of ρ values in CIFAR-100 is most likely due either to the larger number of classes (100 vs 10) or the lower resulting predictive accuracy (which is at least in part due to the larger number of classes). Imagenette on the other hand has the same number of classes as CIFAR-10, but significantly larger images (224x224 vs 32x32), and fewer samples. Further investigation of this relationship remains an interesting future avenue of investigation.



), CIFAR-100Krizhevsky et al. (2009), and ImagenetteHoward, Deng  et al. (2009)  datasets.

Learning Hyperparameter ρ Do regular and adversarial training of F (n) as (f a , f b ) and (f * a , f * b ) respectively Store test accuracy of adversarial training (f * a , f * b ) as Acc * Set k to be as small as possible Initialize ρ a+1 , . . . , ρ n = ∞, . . . , ∞ for t = 1, . . . , T do Load f b ; freeze f * a for e = 1, . . . , E do Do one epoch of adversarial training of f b to get f b Store test accuracy of (f * a , f b ) as Acc e t if Acc * -Acc e t ≤ k then Break out of epoch loop and store Acc e t end end for j = a + 1, . .

Figure 1: Connectivity values of layers in f b on multiple datasets at large relative sizes of f b

Figure 2: Connectivity values of ResNet50 on CIFAR-10 perturbed by AutoAttack

Subnetwork training with AutoAttack on varying setups Model Dataset f b layers Acc * Acc sr Acc Diff. ρ n ρ n-3 ρ n-7 ρ n-11 ρ n-15

