RNAS-CL: ROBUST NEURAL ARCHITECTURE SEARCH BY CROSS-LAYER KNOWLEDGE DISTIL-LATION

Abstract

Deep Neural Networks are vulnerable to adversarial attacks. Neural Architecture Search (NAS), one of the driving tools of deep neural networks, demonstrates superior performance in prediction accuracy in various machine learning applications. However, it is unclear how it performs against adversarial attacks. Given the presence of a robust teacher, it would be interesting to investigate if NAS would produce robust neural architecture by inheriting robustness from the teacher. In this paper, we propose Robust Neural Architecture Search by Cross-Layer Knowledge Distillation (RNAS-CL), a novel NAS algorithm that improves the robustness of NAS by learning from a robust teacher through crosslayer knowledge distillation. Unlike previous knowledge distillation methods that encourage close student/teacher output only in the last layer, RNAS-CL automatically searches for the best teacher layer to supervise each student layer. Experimental result evidences the effectiveness of RNAS-CL and shows that RNAS-CL produces small and robust neural architecture.

1. INTRODUCTION

Neural Architecture Search (NAS), one of the most promising driving tools with state-of-the-art performance of deep neural networks in various tasks such as computer vision and natural language processing, has been attracting a lot of attention in recent years. NAS automatically searches for neural architecture according to user-specified criteria without human intervention, thus avoiding the time-consuming and burdensome manual design of neural architecture. Earlier studies in NAS are based on Evolutionary Algorithms (EA) (Real et al., 2017) and Reinforcement Learning (RL) (Zoph & Le, 2017; Tan et al., 2019) . However, despite their performance, they are computationally expensive. It would take them more than 3000 GPU days to achieve state-of-the-art performance on the ImageNet dataset. Most recent studies (Liu et al., 2019; Cai et al., 2019; Wu et al., 2019; Wan et al., 2020; Nath et al., 2020) encode architectures as a weight-sharing super-net and optimize the weights using gradient descent. Architectures found by NAS exhibit two significant advantages. First, they achieve SOTA performance for various computer vision tasks. Second, the architectures found by NAS are efficient in terms of speed and size. Both advantages make NAS incredibly useful for real-world applications. However, most NAS methods are designed to optimize accuracy, parameters, or FLOPs. It is not clear how these architectures perform against adversarial attacks. In this paper, we propose RNAS-CL, a NAS method that jointly optimizes accuracy, latency, and robustness against adversarial attacks without robust training. Adversarial attacks are performed by adding adversarial samples, for example, adding small sophisticated perturbations to the clean image, such that the model misclassifies the image. It is widely accepted that deep learning models are susceptible to adversarial attacks (Szegedy et al., 2014) . Therefore, it is critical to analyze the robustness of models against adversarial attacks. Adversarial robust models are crucial for security-sensitive applications such as self-driving cars, health care, and surveillance cameras. For example, a self-driving car might not recognize a signboard after attaching a patch; in a surveillance system, an unauthorized person might get access by fooling the DNN model. Adversarial training (Goodfellow et al., 2015; Madry et al., 2018; Kannan et al., 2018; Tramèr et al., 2018; Zhang et al., 2019a) is the most standard defense mechanism against adversarial attacks. Here, the models are trained on adversarial examples, which are often generated by fast gradient sign method (FGSM) (Goodfellow et al., 2015) or projected gradient descent (PGD) (Madry et al., 2018) . Other types of defense mechanisms include models trained by losses or regularizations (Cissé et al., 2017; Hein & Andriushchenko, 2017; Yan et al., 2018; Pang et al., 2020) , transforming inputs before feeding to model (Dziugaite et al., 2016; Guo et al., 2018; Xie et al., 2019) , and using model ensemble (Kurakin et al., 2018; Liu et al., 2018) . Orthogonal to these methods, recent research (Madry et al., 2018; Guo et al., 2020; Su et al., 2018; Xie & Yuille, 2020; Huang et al., 2021) found an intrinsic influence of network architecture on adversarial robustness. Inspired by this idea, we propose Robust Knowledge Distillation for Neural Architecture Search (RNAS-CL), to the best of our knowledge, the first NAS method that uses knowledge distilled from a robust teacher model to find a robust architecture. Knowledge distillation transfers knowledge from a complex teacher model to a small student model. In standard knowledge distillation (Hinton et al., 2015) , outputs from the teacher model are used as "soft labels" to train the student model. However, apart from the final teacher outputs, intermediate layers contain rich attention information. Different intermediate layers attend to different parts of the input object (Zagoruyko & Komodakis, 2017) . Hence, we ask the question: can a robust teacher improve the robustness of the student model by providing information about where to look, i.e., where to pay attention? The proposed RNAS-CL gives affirmative answers to the above question. In RNAS-CL, apart from learning from the output of the robust teacher model, each layer in the student learns "where to look" from the layers in the teacher model. However, the teacher and student might have a different number of layers. This leads us to another question regarding how to map a student layer to its corresponding teacher layer that it should learn from. In RNAS-CL, apart from searching the architecture of the student model, we search for the perfect tutor (teacher) layer for each student layer. Let us consider a teacher (T ) and student (S) model with n t and n s layers, respectively. T i , S i are the i-th teacher and student layer, respectively. In RNAS-CL, each student layer S i is associated with n t gumbel weights, and each gumbel weight corresponds to each teacher layer. Intuitively, each gumbel weight indicates the weight of the connection between the student layer and each teacher layer. In the search phase, besides optimizing the architectural weights, we optimize these gumbel weights to find the perfect teacher layer. We hope the teacher to teach "where to pay attention." Therefore, by virtue of our RNAS-CL loss function for each student-teacher layer pair, each student layers learns robustness from a properly and automatically chosen teacher layer by maximizing the similarity of its attention map to that of its teacher layer.

1.1. CONTRIBUTIONS

Below are the main contributions of this work. 1. Adversarial robust NAS. RNAS-CL optimizes neural architecture to achieve a good tradeoff between robustness and prediction accuracy in a differentiable manner. To the best of our knowl- edge, RNAS-CL is the first NAS method that optimizes robustness and prediction accuracy without robust training. Leveraging the penalty on model size/inference cost, the neural architecture found by RNAS-CL is compact compared to competing NAS methods. We compare RNAS-CL with other computationally efficient and robust models (Sehwag et al., 2020; Ye et al., 2019; Gui et al., 2019; Goldblum et al., 2020; Dong et al., 2020; Huang et al., 2021) . Compared to these models, similar sized RNAS-CL models achieve up to ∼ 10% higher clean accuracy and up to ∼ 5% higher PGD accuracy on CIFAR-10 dataset. 2. Cross-Layer Knowledge Distillation. Our work advances the research of Knowledge Distillation (KD) using NAS. In particular, while conventional KD only uses fixed connections between teacher and student models to guide the student model, RNAS-CL extends the teaching scheme to learnable connections between layers of the teacher and the student models.

2. RELATED WORK

2.1 KNOWLEDGE DISTILLATION Knowledge Distillation (KD) transfers knowledge from a large, cumbersome model to a small model. (Hinton et al., 2015) proposes the teacher-student model, where they use the soft targets from the teacher to train the student model. KD forces the student to generalize, similar to the teacher model. Since (Hinton et al., 2015) , numerous KD variants (Romero et al., 2015; Yim et al., 2017; Zagoruyko & Komodakis, 2017; Li et al., 2019; Tian et al., 2020a; Sun et al., 2019) based on feature map, attention map, or contrastive learning have been proposed. (Romero et al., 2015) introduced intermediate-level hints from the teacher model to guide the student model training. (Romero et al., 2015) trained the student model in two stages. First, they trained the student model such that the student's middle layer predicts the output of the teacher's middle layer (hint layer). Next, they fine-tuned the pre-trained student model using the standard KD optimization function. Thanks to the intermediate hint, the student model achieved better performance with fewer parameters. Moving a step further, (Yim et al., 2017) , (Zagoruyko & Komodakis, 2017) and (Li et al., 2019) used information from multiple teacher layers to guide students' training. (Yim et al., 2017) computed Gramian matrix between the first and the last layer's output features to represent the flow of problem-solving. (Yim et al., 2017) transferred knowledge by minimizing the distance between student and teacher's flow matrix. (Li et al., 2019) calculated the inter-layered Gramian matrix and inter-class Gramian matrix to find the most representative layer and then minimized the distance between a few of the most representative student and teacher layers. (Zagoruyko & Komodakis, 2017) minimized the distance between teacher and student attention maps at the various block. (Li et al., 2020) distills knowledge from teachers' blocks to supervise students' block-wise architecture search. Contrary to the above methods, which map few teacher-student layers or blocks. We map all student layers to a teacher layer. We propose RNAS-CL to search for the perfect tutor layer for each student layer. Similar to (Zagoruyko & Komodakis, 2017) , we minimize the distance between mapped student-teacher attention maps.

2.2. NEURAL ARCHITECTURE SEARCH

Neural Architecture Search (NAS) is a technique that automatically designs neural architecture without human intervention. Given a search space, we can find the best architecture by training all architectures from scratch to convergence; however, this is computationally impractical. Earlier studies in NAS were based on RL (Zoph & Le, 2017; Tan et al., 2019) and EA (Real et al., 2017) ; however, they required lots of computation resources. Most recent studies (Liu et al., 2019; Cai et al., 2019; Wu et al., 2019) encoded architectures as a weight-sharing a super-network. Specifically, they trained an over-parameterized network containing all candidate paths. During training, they introduced weights corresponding to each path. These weights were optimized using gradient descent to select a single network in the end. The selected network is then trained in a standard fashion. Although these methods achieved SOTA performance on various classification tasks, their robustness against adversarial attacks is unknown. (Devaguptapu et al., 2021; Guo et al., 2020; Li et al., 2021; Madry et al., 2018; Su et al., 2018; Xie & Yuille, 2020; Huang et al., 2021) found an intrinsic influence of network architecture on adversarial robustness. (Devaguptapu et al., 2021) observed handcrafted architectures are more robust against adversarial attacks as compared to NAS models. Furthermore, they empirically observed that an increase in model size increased the robustness of the model against adversarial attacks. (Guo et al., 2020) discovered that densely connected architectures are more robust to adversarial attacks. Thus they proposed a NAS method that conducts adversarial training on super-net and then selects the architecture with dense connections. (Li et al., 2021) dilated the backbone network to preserve its standard accuracy and then optimized the architecture and parameters using the adversarial training. Despite SOTA performance, a major drawback lies in the fact that adversarial training is highly time-consuming and decreases the performance on standard (clean) images. This paper proposes a NAS method that optimizes robustness and prediction accuracy without adversarial training.

2.3. EFFICIENT AND ROBUST MODELS

Research community has extensively researched building efficient and adversarially robust models individually. However, few works combine both domains, building an efficient and adversarially robust model. (Sehwag et al., 2020) propose to make the pruning technique aware of the robust training objective. They formulate pruning as an empirical risk minimization (ERM) problem and integrate it with a robust training objective. (Huang et al., 2021) investigated the impact of network width and depth configurations on the robustness of adversarial trained DNNs. They observed that reducing capacity at last blocks improves adversarial robustness. (Goldblum et al., 2020) , propose Adversarially Robust Distillation (ARD), where they encourage student networks to mimic their teacher's output within an ϵ-ball of training samples. Furthermore, there are few NAS methods (Yue et al., 2022; Ning et al., 2020; Xie et al., 2021) that jointly optimises accuracy, latency and robustness. Compared to these methods, similar-sized RNAS-CL models achieve both higher clean and robust accuracy.

3. ROBUST KNOWLEDGE DISTILLATION FOR NEURAL ARCHITECTURE SEARCH

We use knowledge distilled from a robust teacher model to search for a robust and efficient architecture. Knowledge distillation is the transfer of knowledge from a large teacher model to a small student model. In standard knowledge distillation, outputs from the teacher model are used as "soft labels" to train the student model. However, apart from the final teacher outputs, intermediate features constitute important attention information. Different intermediate layers "attend" to different parts of the input object. In RNAS-CL, apart from learning from the teacher's soft labels, the method learns from intermediate teacher layers where to pay attention, i.e., each student layer is mapped to a robust teacher layer to learn where to look. In Section 3.1, we discuss how we define attention maps. We hypothesize that learning where to pay attention from a robust teacher will inherently make the student model more robust to adversarial attacks. Now, the teacher and student could have a different number of layers, which leads us to the question, how to map a student and a teacher layer? In our method, we search for the perfect tutor for each layer. Furthermore, along with increasing the robustness, we are also interested in searching for an efficient architecture. In Section 3.2 and 3.3, we discuss our tutor and architecture search algorithm. Similar to other state-of-the-art NAS methods (Liu et al., 2019; Wu et al., 2019; Wan et al., 2020) , RNAS-CL consists of the searching and training phase. In the search phase, we optimize the architectural weights. In the training phase, we train the architecture sampled from the search phase in a standard fashion. In Section 3.4, we discuss our searching and training optimization objectives.

3.1. ATTENTION MAP

We are interested in learning where to pay attention from a robust teacher model. Let us consider a convolution layer with activation tensor A ∈ R C×H×W where C is the number of channels, and H and W are spatial dimensions. We define a mapping function F : R C×H×W -→ R H×W that takes A as input and outputs an attention map F(A) ∈ R H×W by [F(A)] hw = C c=1 A 2 c,h,w , where A c,h,w represents the element of A with channel coordinate c and spatial coordinates h and w. We use activation-based mapping function F as proposed in (Zagoruyko & Komodakis, 2017) . The mapping function F is applied to activation tensors after each convolution layer to generate an attention map. We visualized few attention maps in Figure 2(b) . RNAS-CL intends to find a teacher layer, referred to as a tutor, for each student layer such that the student layer's attention map is similar to that of its tutor in the teacher model. The student attention map may differ in dimension compared to that of its tutor. To address this issue, we interpolate all attention maps to a common dimension.

3.2. TUTOR SEARCH

As described above, we aim to find a tutor (teacher layer) for each student layer, which teaches where to pay attention. However, each student layer can choose any tutor, resulting in an exponentially large search space. For example, the search space for a student model with 20 layers and a teacher model with 50 layers is of size 50 20 . In order to address the computational issue, we employ Gumbel-Softmax (Jang et al., 2017) to search for the tutor for each student layer in a differentiable manner. Given network parameter v = [v 1 , . . . , v n ] and a constant τ . The Gumbel-Softmax function is defined as g(v) = [g 1 , . . . , g n ] where g i = exp [(vi+ϵi)/τ ] i exp[(vi+ϵi)/τ ] and ϵ i ∼ N (0, 1) is the uniform random noise, which is also referred to as Gumbel noise. When τ → 0, Gumbel-Softmax tends to the arg max function. Gumbel-Softmax is a "re-parametrization trick", that can be regarded as a differentiable approximation to the argmax function. Now consider a teacher T and student S model with n t and n s number of layers, respectively. A i t and A i s are the i th activation tensors of teacher and student layers. In RNAS-CL, each student layer (i) is associated with n t Gumbel weights (g i ) such that g i ∈ R 1×nt . Let g ij be the Gumbel weight associated with i th student and j th teacher layer. Then the attention loss is defined as L Attn (A t , A s ) = 1 n s × n t ns i=0 nt j=0 g ij F(A i s ) ||F(A i s )|| 2 - F(A j t ) ||F(A j t )|| 2 2 , ( ) where A s and A t are activation tensors for all student and teacher convolution layers. F is the mapping function as defined in Section 3.1. ∥ • ∥ 2 is the ℓ 2 -norm. We exponentially decay the temperature τ of Gumbel-Softmax while searching, leading to an encoding close to a one-hot vector.

3.3. ARCHITECTURE SEARCH

Apart from searching the tutor for each layer, we are interested in building efficient architecture with low latency. Inspired by FBNetV2 (Wan et al., 2020) , we search for the optimal number of filters, or the number of output channels, for each convolution block. Let A = {f 1 , f 2 , ..., f n } be the choices of filters and {z 1 , z 2 , ..., z n } be their corresponding outputs for a convolution block. Then the cumulative output is defined as Z = n i=1 g (i) w z i , where g (i) w is the Gumbel weight corresponding to i th filter choice. The number of FLOPs is optimized so as to ensure low latency. The FLOPs are proportional to the number of filters, and the cumulative number of filters is a function of Gumbel weights. As a result, the FLOPs can be optimized in a differential manner using SGD. Similar to tutor search, temperature is exponentially decayed to obtain an encoding which is close to an one-hot vector. Figure 12 in the appendix illustrates the architecture search process by FBNetV2.

3.4. RNAS-CL LOSS

Following the convention of state-of-the-art NAS methods (Liu et al., 2019; Wu et al., 2019; Wan et al., 2020) , RNAS-CL has searching and training phases. In the search phase, Gumbel weights and other model parameters are updated at each epoch of SGD, where the Gumbel weights correspond to the intermediate student-teacher connection (3.1) and the filter choices (3.3). The weights are optimised using our RNAS-CL search loss defined by (2). RNAS-CL search loss. Let y be the ground-truth one-hot encoded vector, p and q be output probabilities of the student and teacher network and A s , A t be activation tensors for all student and teacher convolution layers. Then the RNAS-CL search loss is defined as L(y, p, q, A t , A s ) = (-y log p + KL(p, q) + γ s L Attn (A t , A s ))n f , where KL(p, q) = i p i log pi qi is the Kullback-Leibler(KL) divergence between two probability measures. L Attn is the attention loss as defined in (1) and γ s is a normalization constant. n f represents latency, which is optimized in a differential manner following (Wan et al., 2020) . After the search phase, a tutor is selected as the j * teacher layer with j * = arg max j g ij for each student layer i. In addition, the filter choices described in Section 3.3 for neural architecture are decided as the one corresponding to the maximum Gumbel weight for each convolution block. We then start the training phase, where the searched architecture is trained using the RNAS-CL train loss defined below. RNAS-CL train loss. Let y be the ground-truth one-hot encoded vector, p and q be output probabilities of the student and teacher network, and A t , A s be activation tensors for all student and teacher convolution layers. Then the RNAS-CL train loss is L(y, p, q, A t , A s ) = L CE (y, p) + KL(p, q) + γ t L Attn (A t , A s ), where L CE (y, p) = -y log p is the cross-entropy, KL(p, q) is the KL-divergence, γ t is a normalization constant. Note that, g i in L Attn is a one-hot vector. Thus, each student attention map is optimized w.r.t. to a single tutor layer.

4. EXPERIMENTS

In this section, we conduct experiments on real-world datasets to show the effectiveness of the proposed framework. The experiments section is organized as follows. In Section 4.1, we discuss our experimental setup and implementation details. In Section 4.2, we compare models trained by RNAS-CL against state-of-the-art efficient and robust models on CIFAR-10. In Section 4.3, we empirically show the effectiveness of cross-connections in improving the adversarial robustness of the model. We further discuss the robustness-inducing capacity of teacher layers and compare RNAS-CL models trained on ImageNet-100 in the appendix.

4.1. IMPLEMENTATION DETAILS

In this paper, we evaluate RNAS-CL on two public benchmarks for image classification. (1) CIFAR-10 -a collection of 60k images in 10 classes (Krizhevsky, 2009) . ( 2) ImageNet-100 -a subset of ImageNet-1k dataset (Russakovsky et al., 2015) with 100 classes and about 130k images (Tian et al., 2020c) . We use standard data augmentation techniques for each dataset, such as random-resize cropping and random flipping. We train different architectures found by RNAS-CL on both CIFAR-10 and ImageNet-100. On each dataset, we first perform the searching step. We train our model using RNAS-CL search loss (2). We search for the channel number and the connected teacher layer at each student layer. We conduct experiments with different search spaces and various robust teacher models. In this section, we refer to our model by RNAS-CL-X-T where X represents our search space, and T represents the robust teacher model. Detailed search space is provided in Table 7 For both datasets, we use SGD optimizer. For ImageNet-100, default values of momentum and weight decay are set to 0.9 and 4e -5, respectively. The batch size is set to 256. The learning rate is initialized as 0.05 and annealed down to zero following a cosine schedule. After the search stage which takes 100 epochs, the searched architecture is trained from scratch using RNAS-CL train loss (3) for 200 epochs. For CIFAR-10, default values of momentum and weight decay are set to 0.9 and 2e -4, respectively. The batch size is set to 128. We train our model for 100 epochs in both the searching and training phases. The learning rate is initialized as 0.1, and reduced by a factor of 10 after the 75 th and the 90 th epoch. Following the settings of FBNetV2, the temperature (τ ) in Gumbel-Softmax is initialized as 5.0 and exponentially annealed by e -0.045 every epoch in the search phase. The hyper-parameter λ s and λ t in (2, 3) is selected from a candidate set {0.01, 0.1, 0.1, 1.0, 10, 100}. Both λ s and λ t are set to 1.0 for all experiments. In the search phase for each batch, we use 80% of the data to optimize the model weights and the remaining 20% data to optimize architectural weights which are Gumbel weights. For robustness evaluation, we choose five powerful attacks including FGSM (Goodfellow et al., 2015) , MI-FGSM (Dong et al., 2018) , PGD (Madry et al., 2018) , CW (Carlini & Wagner, 2017) and AutoAttack (Croce & Hein, 2020) . Results for CW and AutoAttack are provided in the appendix A.4. Consistent with adversarial literature (Madry et al., 2018; Zhang et al., 2019b) , the adversarial perturbation is considered under l ∞ norm with a total perturbation scale of 8/255 (0.031).

4.2. COMPARE EFFICIENT AND ROBUST CIFAR-10 MODELS

In this section, we compare the robustness of our method against other SOTA efficient and robust models. In Table 1 , we compare RNAS-CL to both efficient models trained with and without adversarial training. All RNAS-CL models are trained with robust WideResNet-34 (Rice et al., 2020) as the teacher model. RNAS-CL significantly outperforms all models trained without adversarial training in terms of adversarial accuracy. While being significantly smaller, our models achieve significantly higher adversarial accuracy when compared to models trained without adversarial training. For example, RNAS-CL-S7-WRT-34 achieves more than 28% higher PGD accuracy compared to most of the other methods. Compared to MVVV2-ARD, RNAS-CL-S7-WRT-34 achieves ∼ 1% lower PGD accuracy; however, it exceeds MVVV2-ARD by 14.5% in clean accuracy while being 10× smaller. A similar-sized model, for example, RNAS-CL-M-WRT-34, exceeds both clean and PGD accuracy by 16.5% and 1.43%. Next, we compare RNAS-CL against adversarially trained robust models. For a fair comparison, after the training stage, we train our RNAS-CL models with the TRADES optimization objective for 20 epochs. For retraining, the cross-entropy term in (3) is replaced by TRADES optimization objective. Adversarially training RNAS-CL models improve its adversarial accuracy. RNAS-CL models achieve similar or higher adversarial accuracy compared to other adversarially trained models. However, RNAS-CL models are much smaller and achieve significantly higher clean accuracy. For example, in Table 1 , RNAS-CL-M-WRT-34 achieves similar or higher adversarial accuracy than most other methods while being smaller and significantly exceeding in terms of clean accuracy. We also obtain much smaller models using RNAS-CL. Tiny RNAS-CL models exceed their counterpart by more than ∼ 12% in terms of clean accuracy. For example, RNAS-CL-S5-WRT-34 exceeds HYDRA (ResNet-34) by 12.95% while achieving similar adversarial accuracy. Similar results can also be visualized in Figure 1 . In Figure 1 , RNAS-CL models are on the top right corner of the plot, representing the models with the highest clean and adversarial accuracy. Results for RNAS-CL models trained with different robust teachers have been added to the appendix A.2.

Comparison against various perturbation budget

To further illustrate the effectiveness of RNAS-CL, we compare RNAS-CL with previously proposed defense mechanisms against various perturbation budgets. In Figure 3 , we compare various methods against PGD and FSGM attacks. For both attacks, RNAS-CL outperforms its counterparts at all perturbations. RNAS-CL significantly outperforms other methods as perturbation size increases. For ϵ = 0.1, RNAS-CL exceeds other methods by ∼20% for both PGD and FSGM attacks. 

4.3. ABLATION STUDY

This ablation study demonstrates the significance of student-teacher cross-layer connections in RNAS-CL. We compare four types of training paradigms. In the first training paradigm, we conduct searching and training using cross-entropy loss without any teacher model. We refer to this as standard. Next, in the second paradigm, we conduct searching and training by minimizing the cross-entropy loss and standard KL Divergence with a robust teacher model. We refer to them as KL-X-T, where X represents the search space and T represents the robust teacher model. In the third paradigm we search and train using cross-entropy loss and intermediate cross connections (ICC). We refer to them as ICC-X-T. Finally, the fourth model type is RNAS-CL, where we include all three terms, cross-entropy loss, KL Divergence, and cross-layer student-teacher connections. In Figure 4 (a), we compare the attention maps from student models trained using RNAS-CL-I-R-50 against students trained using KL-I-R-50. We compare attention maps for various convolution layers at regular intervals. As expected, adding cross-layer connections obtains attention maps from the student model closer to the teacher model. Each student layer learns where to pay attention from its connected teacher layer. For example, in column (b), the KL-I-R-50 layer attends to various parts of the image, whereas the RNAS-CL layer learning from 28 th teacher layer pays more attention 

A APPENDIX

A.1 ROBUST TEACHER LAYERS In this section, we discuss robustness inducing capacity of teacher layers. We hypothesize that few teacher layers are more robust than others and thus should induce more robustness to the student models. In RNAS-CL, each student layer is associated with a teacher layer. Figures 6 and 8 plot the number of student layers connected to each robust teacher layer on the CIFAR-10 and ImageNet-100 datasets. For all student models on CIFAR-10, we observe that layers 15 and 21 of the robust teacher model have significantly more intermediate connections with the student models. Similarly, for ImageNet-100, layers 18, 32, and 40 are a few of the dominant robust layers. In Figures 7 and 9 , we visualize the most robust teacher layers on CIFAR-10 and ImageNet, respectively. Figure 6 : Illustrations of the number of student layers connected to each teacher layer in RNAS-CL for various student models on the CIFAR-10 dataset. We choose adversarially trained Wide-ResNet-34 as the robust teacher model for all the four student models, with one plot for each student model. All student architectures are described in Table 7 . 1 . All MACs were calculated without special hardware (Han et al., 2016) or special software (Park et al., 2017) objective. With such training, RNAS-CL achieves similar or higher adversarial accuracy while significantly outperforming Hydra and LWM in clean accuracy with only a fraction of MACs. We further study adversarial accuracy at various perturbation budgets for three different teacher models. As illustrated in Figure 10 , RNAS-CL exceeds its counterpart in adversarial accuracy at various perturbation budgets for all teacher models on the ImageNet-100 dataset. This demonstrates the significance of cross-layer connections in RNAS-CL. A.4 COMPARE CIFAR-10 MODEL AGAINST CW AND AUTOATTACK In this section, we compare RNAS-CL and (Huang et al., 2021) against recent attacks such as CW ∞ (Carlini & Wagner, 2017) and AutoAttack (Croce & Hein, 2020) on CIFAR-10 dataset. CW attacks were proposed to defeat defensive distillation. In Table 4 , we use L ∞ version of CW attack optimized by PGD, with maximum perturbation budget set to ϵ = 8/255. AutoAttack is a parameter-free ensemble attack currently considered one of the most reliable and widely acknowledged evaluation benchmark in Adversarial Defences. Method CW∞ AA VGG-R (Huang et al., 2021) 46.49 38.44 DN-121-R (Huang et al., 2021) 53 (Huang et al., 2021) and RNAS-CL against CW ∞ (Carlini & Wagner, 2017) and AutoAttack (Croce & Hein, 2020) on CIFAR-10 dataset. 24, 20, 16 40, 36, 32, 28 96, 88, 80, 72, 64, 56, 24, 20, 16 40, 36, 32, 28 96, 88, 80, 72, 64, 56, 48 128 120, 108, 100, 92, 84, 76, 24, 20, 16 40, 36, 32, 28 96, 88, 80, 72, 64, 56, 48 128 120, 108, 100, 92, 84, 76, 68 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 108 



Figure 1: The figure compares various SOTA efficient and robust methods on CIFAR-10. Clean Accuracy represents top-1 accuracy on clean images. Adversarial Accuracy represents top-1 accuracy on images perturbed by PGD attack. A larger marker size indicates larger architecture. The numbers in brackets represent the number of parameters and MACs, respectively.

Figure 2: (a) Training paradigm based on RNAS-CL.We connect attention maps from each student layer to each robust teacher layer. For each student layer, we search for the optimum teacher layer. g ij represents gumbel weights associated between i th student layer and j th teacher layer. RNAS-CL induces robustness to the student model by searching for the optimum teacher layer. We also search for the number of filters in each layer to build an efficient model inspired by FBNetV2(Wan et al., 2020). (b) Sample attention maps corresponding to input Image (i) from low-level (ii), mid-level (iii), and high-level (iv) convolution layers.

Figure 3: Robustness evaluation under different perturbation sizes for PGD and FGSM attacks.

Figure 4: (a) KL-I-R-50 represents attention maps from a model trained using cross-entropy loss and knowledge distillation without any cross-layer connections. Teacher and RNAS-CL represent attention maps from the robust teacher (ResNet-50) and RNAS-CL model. Name for each RNAS-CL layer includes its connected teacher layer. For example, in 0 th layer (13), 13 represents the corresponding teacher layer. RNAS-CL drives attention maps from student layers closer to their corresponding teacher layer.(b) Illustrations of the number of student layers connected to each teacher layer in RNAS-CL for various student models on the CIFAR-10 dataset.

Figure 5: Adversarial accuracy of various models at various perturbation budgets.central part of the image. Similarly, in column (c), the RNAS-CL layer learns from the teacher model to pay more attention to the central and upper portions of the image. In Figure5, we compare RNAS-CL models against KL-X-T and standard models against PGD attacks at various perturbation budgets on the CIFAR-10 dataset. The RNAS-CL and ICC models outperform their counterparts, demonstrating the significance of cross-connections.Robust Teacher layers We hypothesize that few teacher layers are more robust than others and thus should induce more robustness to the student models. In RNAS-CL, each student layer is associated with a teacher layer. Figures4(b) illustrates the number of student layers connected to each robust teacher layer on the CIFAR-10 dataset. For both student models, we observe that 15 th and 21 st layers of the robust teacher model have significantly more intermediate connections, suggesting that few teacher layers have more robustness-inducing capacity than others. Plots for other student models have been added to the appendix A.1.

Figure7: Attention map for most robust teacher layers on CIFAR-10 dataset. We chose the same robust teacher model as in Figure6. The illustrated layers represent teacher layers with maximum number of intermediate connection for various RNAS-CL models (as described in Figure6).

Figure 10: Adversarial accuracy of various models at various perturbation budgets on the ImageNet-100 dataset.

COMPARISON AGAINST KD VARIENTS In this section, we compare our methods against various knowledge distillation methods Park et al. (2019); Ahn et al. (2019); Tung & Mori (2019); Tian et al. (2020b); Passalis & Tefas (2018). We use Robust WRT-34 as the teacher model for all KD methods and train three different student architectures: RNAS-CL-S3, RNAS-CL-S5, and RNAS-CL-S7. In Figure11, models trained using our paradigm are explicitly on the upper right-most part of the graph. RNAS-CL-S3 architecture trained using RKD performs similarly to the model trained using our methods. Apart from this, all models trained using RNAS-Cl significantly outperform all other methods in terms of clean and adversarial accuracy.

Figure 11: The figure compares various knowledge distillation variants (Similarity (Tung & Mori, 2019), VID (Ahn et al., 2019), RKD (Park et al., 2019), CRD (Tian et al., 2020b), PKD (Passalis & Tefas, 2018)) against RNAS-CL on the CIFAR-10 dataset. Adversarial Accuracy represents top-1 Accuracy on images perturbed by 20 step PGD attack. Clean Accuracy represents top-1 Accuracy on clean images. Larger marker size indicates larger architecture. For each method, RNAS-CL-S3, RNAS-CL-S5, and RNAS-CL-S7 are represented by increasing marker size.

Figure 12: Illustration of searching for the neural architecture of each layer of student model using the searching mechanism in FBNetV2. g i w represents gumbel weights associated with each mask.

The table shows performance of various efficient and robust methods on CIFAR-10 dataset. Clean Acc represents top-1 accuracy on clean images. FSGM, PGD 20 , MI-FGSM represents top-1 accuracy on images perturbed by the corresponding attacks. PGD 20 represents 20 step PGD attack.

We show that models obtained by RNAS-CL outperform all models obtained without robust training in terms of adversarial robustness. We show that adding adversarial training can further increase the adversarial robustness of RNAS-CL models. After robust training, RNAS-CL achieves similar adversarial robustness compared to models obtained via robust training while outperforming them in terms of clean accuracy. For future work, we plan to incorporate robust training in the searching stage to further increase the robustness of the model.

Performance of RNAS-CL method trained with various robust teacher models on the CIFAR-10 dataset. Standard represents models searched and trained by cross-entropy loss without any teacher model.



The table compared performance of

The table describes the search space for ImageNet-100. Similar to Table7, depth represents the depth of each stage. For ImageNet-100, we have up to 5 stages. Stage 1, Stage 2, Stage 3, Stage 4, and Stage 5 represent the filter choices for their respective stages. For example, in stage 1, for each convolution block, we search for its channel within 4 output channel options(28, 24, 20, 16).

funding

4open.science/r/RNAS-CL-06A0/.

annex

Figure 8 : Illustrations of the number of student layers connected to each teacher layer in RNAS-CL for various student models on the ImageNet-100 dataset. We choose adversarially trained Wide-ResNet-50 as the robust teacher for all and three students models, with one plot for each student model. All RNAS-CL architectures are described in Table 8 . 

A.2 MORE RESULTS ON CIFAR-100

In this section, we conduct experiments using adversarially trained WRT-34 (Rice et al., 2020) , ResNet-50 (Engstrom et al., 2019) , and ResNet-18 (Sehwag et al., 2021) as the robust teacher models on the CIFAR-10 dataset. All RNAS-Cl models, while achieving similar clean accuracy, exceed its counterpart by more than 10% in PGD accuracy. RNAS-CL-R50 achieves higher robust accuracy than RNAS-CL-R18 and RNAS-CL-WRT-34. However, ResNet-50 has the lowest PGD accuracy among the teacher models, suggesting that the teacher's architecture has more influence on the student's performance than the teacher's performance. The higher number of teacher layers allows more options for the student layer to learn from, leading to better robustness. The teacher models' performance is reported in Table 6 .

A.3 COMPARE EFFICIENT AND ROBUST IMAGENET-100 MODELS

We compare RNAS-CL to adversarially robust pruning methods on ImageNet-100 dataset, with results shown in Table 3 . RNAS-CL models are trained with three different robust teachers, ResNet-18, ResNet-50, and WideResNet-50, with the ImageNet pre-trained (Engstrom et al., 2019) being the robust teacher. It is observed that RNAS-CL models consistently exceed other models by ∼ 25% in terms of clean accuracy while exibiting adversarial robustness. In Table 3 , both Hydra and LWM were adversarially trained using TRADES (Zhang et al., 2019a) . For a fair comparison, after the regular training stage without TRADES, we retrain our RNAS-CL models with the TRADES optimization objective. We replace the cross-entropy term in (3) by the TRADES optimization model trained using our training paradigm. Both models are further adversarially trained using FastAT (Wong et al., 2020) . In 

A.8 ARCHITECTURE

In this section, we discuss architectures for various proposed super-nets used in RNAS-CL for CIFAR-10 and ImageNet-100 datasets. Table 7 describes the super-nets used for CIFAR-10. We use super-nets with three blocks. Super-nets used for ImageNet-100 are described in Table 8 . For ImageNet-100, the number of blocks varies from 3 to 5. 12 32, 28, 24, 20 64, 60, 56, 12 32, 28, 24, 20 64, 60, 56, 12 32, 28, 24, 20 64, 60, 56, 76 160, 156, 152, 148 128, 124, 120, 156 320, 316, 312, 308 256, 252, 248, 244 Table 7 : The table describes the search space for CIFAR-10. Depth represents the depth of each stage. For example, 3-3-3 represents three convolution blocks in each stage. All search spaces have three stages. Stage 1, Stage 2, and Stage 3 represent the filter choices for their respective stages.For example, at stage 3 of RNAS-CL-S3, for each convolution block, we search between 4 output channels (64, 60, 56, 52).A.9 ARCHITECTURE SEARCH BY FBNETV2 RNAS-CL builds both an efficient and adversarially robust deep learning model. In this work, we use the training paradigm of FBNetV2 to search for efficient models. In Figure 12 , we illustrate the searching process for neural architecture at a single convolution layer. Each filter choice is attached with a Gumbel weight. These Gumbel weights are optimized to select an efficient model.

