THE DARK SIDE OF AUTOML: TOWARDS ARCHITEC-TURAL BACKDOOR SEARCH

Abstract

This paper asks the intriguing question: is it possible to exploit neural architecture search (NAS) as a new attack vector to launch previously improbable attacks? Specifically, we present EVAS, a new attack that leverages NAS to find neural architectures with inherent backdoors and exploits such vulnerability using input-aware triggers. Compared with existing attacks, EVAS demonstrates many interesting properties: (i) it does not require polluting training data or perturbing model parameters; (ii) it is agnostic to downstream fine-tuning or even re-training from scratch; (iii) it naturally evades defenses that rely on inspecting model parameters or training data. With extensive evaluation on benchmark datasets, we show that EVAS features high evasiveness, transferability, and robustness, thereby expanding the adversary's design spectrum. We further characterize the mechanisms underlying EVAS, which are possibly explainable by architecture-level "shortcuts" that recognize trigger patterns. This work showcases that NAS can be exploited in a harmful way to find architectures with inherent backdoor vulnerability. The code is available at https://github.com/ain-soph/nas_backdoor. Next, we present EVAS, a new backdoor attack leveraging NAS to find neural arches with exploitable vulnerability. We begin by introducing the threat model. A backdoor attack injects a hidden malicious function ("backdoor") into a target model (Pang et al., 2022). The backdoor is activated once a pre-defined condition ("trigger") is present, while the model

1. INTRODUCTION

As a new paradigm of applying ML techniques in practice, automated machine learning (AutoML) automates the pipeline from raw data to deployable models, which covers model design, optimizer selection, and parameter tuning. The use of AutoML greatly simplifies the ML development cycles and propels the trend of ML democratization. In particular, neural architecture search (NAS), one primary AutoML task, aims to find performant deep neural network (DNN) archesfoot_0 tailored to given datasets. In many cases, NAS is shown to find models remarkably outperforming manually designed ones (Pham et al., 2018; Liu et al., 2019; Li et al., 2020) . In contrast to the intensive research on improving the capability of NAS, its security implications are largely unexplored. As ML models are becoming the new targets of malicious attacks (Biggio & Roli, 2018) , the lack of understanding about the risks of NAS is highly concerning, given its surging popularity in security-sensitive domains (Pang et al., 2022) . Towards bridging this striking gap, we pose the intriguing yet critical question: Is it possible for the adversary to exploit NAS to launch previously improbable attacks? This work provides an affirmative answer to this question. We present exploitable and vulnerable arch search (EVAS), a new backdoor attack that leverages NAS to find neural arches with inherent, exploitable vulnerability. Conventional backdoor attacks typically embed the malicious functions ("backdoors") into the space of model parameters. They often assume strong threat models, such as polluting training data (Gu et al., 2017; Liu et al., 2018; Pang et al., 2020) or perturbing model parameters (Ji et al., 2018; Qi et al., 2022) , and are thus subject to defenses based on model inspection (Wang et al., 2019; Liu et al., 2019) and data filtering (Gao et al., 2019) . In EVAS, however, as the backdoors are carried in the space of model arches, even if the victim trains the models using clean data and operates them in a black-box manner, the backdoors are still retained. Moreover, due to its independence of model parameters or training data, EVAS is naturally robust against defenses such as model inspection and input filtering. To realize EVAS, we define a novel metric based on neural tangent kernel (Chen et al., 2021) , which effectively indicates the exploitable vulnerability of a given arch; further, we integrate this metric into the NAS-without-training framework (Mellor et al., 2021; Chen et al., 2021) . The resulting search method is able to efficiently identify candidate arches without requiring model training or backdoor testing. To verify EVAS's empirical effectiveness, we evaluate EVAS on benchmark datasets and show: (i) EVAS successfully finds arches with exploitable vulnerability, (ii) the injected backdoors may be explained by arch-level "shortcuts" that recognize trigger patterns, and (iii) EVAS demonstrates high evasiveness, transferability, and robustness against defenses. Our findings show the feasibility of exploiting NAS as a new attack vector to implement previously improbable attacks, raise concerns about the current practice of NAS in security-sensitive domains, and point to potential directions to develop effective mitigation.

2. RELATED WORK

Next, we survey the literature relevant to this work. Neural arch search. The existing NAS methods can be categorized along search space, search strategy, and performance measure. Search space -early methods focus on the chain-of-layer structure (Baker et al., 2017) , while recent work proposes to search for motifs of cell structures (Zoph et al., 2018; Pham et al., 2018; Liu et al., 2019) . Search strategy -early methods rely on either random search (Jozefowicz et al., 2015) or Bayesian optimization (Bergstra et al., 2013) , which are limited in model complexity; recent work mainly uses the approaches of reinforcement learning (Baker et al., 2017) or neural evolution (Liu et al., 2019) . Performance measure -one-shot NAS has emerged as a popular performance measure. It considers all candidate arches as different sub-graphs of a super-net (i.e., the one-shot model) and shares weights between candidate arches (Liu et al., 2019) . Despite the intensive research on NAS, its security implications are largely unexplored. Recent work shows that NAS-generated models tend to be more vulnerable to various malicious attacks than manually designed ones (Pang et al., 2022; Devaguptapu et al., 2021) . This work explores another dimension: whether it can be exploited as an attack vector to launch new attacks, which complements the existing studies on the security of NAS. Backdoor attacks and defenses. Backdoor attacks inject malicious backdoors into the victim's model during training and activate such backdoors at inference, which can be categorized along attack targets -input-specific (Shafahi et al., 2018) , class-specific (Tang et al., 2020) , or any-input (Gu et al., 2017) , attack vectors -polluting training data (Liu et al., 2018) or releasing infected models (Ji et al., 2018) , and optimization metrics -attack effectiveness (Pang et al., 2020) , transferability (Yao et al., 2019) , or attack evasiveness (Chen et al., 2017) . To mitigate such threats, many defenses have also been proposed, which can be categorized according to their strategies (Pang et al., 2022) : input filtering purges poisoning samples from training data (Tran et al., 2018) ; model inspection determines whether a given model is backdoored (Liu et al., 2019; Wang et al., 2019) , and input inspection detects trigger inputs at inference time (Gao et al., 2019) . Most attacks and defenses above focus on backdoors implemented in the space of model parameters. Concurrent to this work, Bober-Irizar et al. ( 2022) explore using neural arches to implement backdoors by manually designing "trigger detectors" in the arches and activating such detectors using poisoning data during training. This work investigates using NAS to directly search for arches with exploitable vulnerability, which represents a new direction of backdoor attacks. behaves normally otherwise. In a predictive task, the backdoor is often defined as classifying a given input to a class desired by the adversary, while the trigger can be defined as a specific perturbation applied to the input. Formally, given input x and trigger r = (m, p) in which m is a mask and p is a pattern, the trigger-embedded input is defined as: + j J Y C i Q E x I x x l q N J L j k 6 E N n G b i F q e m q s R r 0 M 3 T W O h m H e I t f k v f I u P q S 1 Q C I 5 B Q 0 / V V 8 V 1 X 9 l p Z K e k u R 3 J 9 p 6 9 P j J 0 + 2 d 7 r P n L 1 7 u 7 u 0 f X H h b O Y F D Y Z V 1 V x l 4 V N L g k C Q p v C o d g s 4 U X m b T z / P 6 5 R 0 6 L 6 3 5 R n W J I w 2 3 R h Z S A I X U d b 8 Y c 1 D l B P r j v V 4 y S B Y R P x T p S v T Y K s 7 H + 5 1 r n l t R a T Q k F H h / k y Y l j R p w J I X C t s s r j y W I K d z i T Z A G N P p R s 1 i 5 j V + F T B 4 X 1 o V n K F 5 k / + 1 o Q H t f 6 y y Q G m j i 1 2 q z 5 Z B N f g 7 + l 8 / 0 O u t J g 6 t d v r E j F R 9 H j T R l R W j E c s W i U j H Z e O 5 c n E u H g l Q d B A g n w y 9 j M Q E H g o K / X W = " > A A A C i 3 i c b Z H b a h R B E I Z 7 x 1 N c j d n o l X g z u C v E m 2 U m H l E v g i J 4 G c F N g u l l 6 a m p y T T b h 6 G 7 N u 4 w D D 6 N t / o 8 v o 2 9 B 9 D d W N D w U / V 1 U f V X V i n p K U l + d 6 J r 1 2 / c v L V z u 3 v n 7 u 6 9 v d 7 + / R N v Z w 5 w B F Z Z d 5 Y J j 0 o a H J E k h W e V Q 6 E z h a f Z 9 M O i f n q J z k t r v l B d 4 V i L C y M L C Y J C a t J 7 O C g m X K i q F A c c c k t v Y 0 4 l k n g 6 m P T 6 y T B Z R n x V p G v R Z + s 4 n u x 3 v v L c w k y j I V D C + / M 0 q W j c C E c S F L Z d P v N Y C Z i K C z w P 0 g i N f t w s d 2 j j J y G T x 4 V 1 4 R m K l 9 l / f z R C e 1 / r L J B a U O k 3 a v N V k 2 1 + A f 6 X z / Q m 6 0 k L V 7 t 8 a 0 Y q X o 8 b a a o Z o Y H V i M V M x W T j h Z V x L h 0 C q T o I A U 6 G L W M o h R N A w f A u N / g N r N b C 5 A 0 H k A 7 a h k / R m W T 4 A u f 8 E o J N 6 B p + d T s i a t O J k / g B H a f V G R g = " > A A A C e 3 i c b Z H b a h R B E I Z 7 x 1 N c D 0 n 0 0 p v B X U F E l p l g 1 M u g N 7 m M k E 2 C 2 0 v o q a n N N t u H o b s m 7 t D M Y 3 i r z + X D C O k 9 g N m N B Q 0 / V V 8 V 1 X 8 V l Z K e s u x P J 7 l 3 / 8 H D R z u P u 0 + e P n u + u 7 f / 4 s z b 2 g E O w S r r L g r h U U m D Q 5 K k 8 K J y K H S h 8 L y Y f V 3 U z 6 / R e W n N K T U V j r W 4 M n I i Q V B M j f q c p C o x z N v + 5 V 4 v G 2 T L S O + K f C 1 6 b B 0 n l / u d 7 7 y 0 U G s 0 B E p 4 P 8 q z i s Z B O J K g s O 3 y 2 m M l Y C a u c B S l E R r 9 O C x 3 b t M 3 M V O m E + v i M 5 Q u s 7 c 7 g t D e N 7 q I p B Y 0 9 R u 1 + W r I N r 8 A / 8 s X e p P 1 p I V r X L m 1 I 0 0 + j 4 M 0 V U 1 o Y L X i p F Y p 2 X R h X V p K h 0 C q i U K A k / G X K U y F E 0 D R 4 C 4 3 + A O s 1 s K U g Q N I B 2 3 g M 3 Q m G x z i n F 9 D t A l d 4 N P C z k O f + z i h I k + N Q r 6 A + 2 3 7 j 2 6 7 8 R r 5 t v d 3 x d n B I P 8 4 + P D t o H f 0 Z X 2 X H f a K v W Z v W c 4 + s S N 2 z E 7 Y k A G z 7 C f 7 x X 5 3 / i a 9 5 F 3 y f o U m n X X P S 7 Y R y e E N i u / F H w = = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " u 8 L f j t M O N k w F f m r C Z N w b P O U O g N k = " > A A A C h 3 i c b Z F L b 9 N A E M c 3 5 t W G V w r i x M U i Q S q X Y F f Q V u J S 4 M K x S K S t 6 E b R e j x J V t m H t T t O Y 1 n + M F z h E / F t 2 D w k S M p I K / 0 1 8 5 v R 7 H + y Q k l P S f K 7 F d 2 5 e + / + g 7 3 9 9 s N H j 5 8 8 7 R w 8 u / C 2 d I A D s M q 6 q 0 x 4 V N L g g C Q p v C o c C p 0 p v M x m n 5 f 1 y z k 6 L 6 3 5 R l W B Q y 0 m R o 4 l C A q p U e d F b 3 L I I b f 0 I e Z z 4 W i K J N 7 0 R p 1 u 0 k 9 W E d 8 W 6 U Z 0 2 S b O R w e t 7 z y 3 U G o 0 B E p 4 f 5 0 m B Q 3 r M F G C w q b N S 4 + F g J m Y 4 H W Q R m j 0 w 3 q 1 f x O / D p k 8 H l s X n q F 4 l f 2 3 o x b a + 0 p n g d S C p n 6 r t l g P 2 e W X 4 H / 5 T G + z n r R w l c t 3 d q T x 6 b C W p i g J D a x X H J c q J h s v b Y x z 6 R B I V U E I c D L 8 M o a p c A I o m N 3 m B m / A a i 1 M X n M A 6 a C p + Q y d S f r v c c H n E G x C V / N p Z h d 1 j / s w o S B P l U K + h H t N 8 5 d u 2 u E a 6 a 7 3 t 8 X F U T 8 9 7 r / 7 e t Q 9 + 7 S 5 y x 5 7 y V 6 x Q 5 a y E 3 b G v r B z N m D A a v a D / W S / o v 3 o b X Q c n a 7 R q L X p e c 6 2 I v r 4 B 6 l s x 6 s = < / l a t e x i t > g(•; #) < l a t e x i t s h a 1 _ b a s e 6 4 = " h S K F q I O + R 8 A L 5 C b / D i u 2 K T F C W l 4 = " > A A A C c 3 i c b Z F L b 9 N A E M c 3 p k B J e b R w 7 M V q j M S F y K 5 4 H S u 4 c C x q 0 1 b t R t V 6 P E l W 2 Y e 1 O y 6 x L H 8 E r v D Z + C D c 2 T y k N m l H W u m v m d + M Z v + T l 0 p 6 S t O / n e j R 1 u M n T 7 e f d X e e v 3 j 5 a n f v 9 Z m 3 l Q M c g F X W X e T C o 5 I G B y R J 4 U X p U O h c 4 X k + / T a v n 9 + g 8 9 K a U 6 p L H G o x N n I k Q V B I n S S z 5 H q 3 l / b T R c T 3 R b Y S P b a K 4 + u 9 z i U v L F Q a D Y E S 3 l 9 l a U n D R j i S o L D t 8 s p j K W A q x n g V p B E a / b B Z 7 N r G b 0 O m i E f W h W c o X m T v d j R C e 1 / r P J B a 0 M S v 1 W b L I Z v 8 H H y Q z / U 6 6 0 k L V 7 t i Y 0 c a f R k 2 0 p Q V o Y H l i q N K x W T j u W V x I R 0 C q T o I A U 6 G X 8 Y w E U 4 A B W O 7 3 O B P s F o L U z Q c Q D p o G z 5 F Z 9 L + R 5 z x G w g 2 o W v 4 J L e z J u E + T C j J U 6 2 Q z + G k b W / p t h u u k W 1 6 f 1 + c H f a z T / 0 P P w 5 7 R 1 9 X d 9 l m + + y A v W M Z + 8 y O 2 H d 2 z A Y M 2 J j 9 Y r / Z n 8 6 / a D 8 6 i J I l G n V W P W / Y W k T v / w N T O M F p < / l x = x ⊙ (1 -m) + p ⊙ m (1) Let f be the backdoor-infected model. The backdoor attack implies that for given input-label pair (x, y), f (x) = y and f (x) = t with high probability, where t is the adversary's target class. The conventional backdoor attacks typically follow two types of threat models: (i) the adversary directly trains a backdoor-embedded model, which is then released to and used by the victim user (Liu et al., 2018; Pang et al., 2020; Ji et al., 2018) ; or (ii) the adversary indirectly pollutes the training data or manipulate the training process (Gu et al., 2017; Qi et al., 2022) to inject the backdoor into the target model. As illustrated in Figure 1 , in EVAS, we assume a more practical threat model in which the adversary only releases the exploitable arch to the user, who may choose to train the model from scratch using clean data or apply various defenses (e.g., model inspection or data filtering) before or during using the model. We believe this represents a more realistic setting: due to the prohibitive computational cost of NAS, users may opt to use performant model arches provided by third parties, which opens the door for the adversary to launch the EVAS attack. However, realizing EVAS represents non-trivial challenges including (i) how to define the trigger patterns? (ii) how to define the exploitable, vulnerable arches? and (iii) how to search for such arches efficiently? Below we elaborate on each of these key questions.

3.2. INPUT-AWARE TRIGGERS

Most conventional backdoor attacks assume universal triggers: the same trigger is applied to all the inputs. However, universal triggers can be easily detected and mitigated by current defenses (Wang et al., 2019; Liu et al., 2019) . Moreover, it is shown that implementing universal triggers at the arch level requires manually designing "trigger detectors" in the arches and activating such detectors using poisoning data during training (Bober-Irizar et al., 2022), which does not fit our threat model. Instead, as illustrated in Figure 1 , we adopt input-aware triggers (Nguyen & Tran, 2020) , in which a trigger generator g (parameterized by ϑ) generates trigger r x specific to each input x. Compared with universal triggers, it is more challenging to detect or mitigate input-aware triggers. Interestingly, because of the modeling capacity of the trigger generator, it is more feasible to implement input-aware triggers at the arch level (details in § 4). For simplicity, below we use x = g(x; ϑ) to denote both generating trigger r x for x and applying r x to x to generate the trigger-embedded input x.

3.3. EXPLOITABLE ARCHES

In EVAS, we aim to find arches with backdoors exploitable by the trigger generator, which we define as the following optimization problem. Specifically, let α and θ respectively denote f 's arch and model parameters. We define f 's training as minimizing the following loss: L trn (θ, α) ≜ E (x,y)∼D ℓ(f α (x; θ), y) where f α denotes the model with arch fixed as α and D is the underlying data distribution. As θ is dependent on α, we define: θ α ≜ arg min θ L trn (θ, α) Further, we define the backdoor attack objective as: L atk (α, ϑ) ≜ E (x,y)∼D [ℓ(f α (x; θ α ), y) + λℓ(f α (g(x; ϑ); θ α ), t)] where the first term specifies that f works normally on clean data, the second term specifies that f classifies trigger-embedded inputs to target class t, and the parameter λ balances the two factors. Note that we assume the testing data follows the same distribution D as the training data. Overall, we consider an arch α * having exploitable vulnerability if it is possible to find a trigger generator ϑ * , such that L atk (α * , ϑ * ) is below a certain threshold.

3.4. SEARCH WITHOUT TRAINING

Searching for exploitable archs by directly optimizing Eq. 4 is challenging: the nested optimization requires recomputing θ (i.e., re-training model f ) in L trn whenever α is updated; further, as α and ϑ are coupled in L atk , it requires re-training generator g once α is changed. Motivated by recent work (Mellor et al., 2021; Wu et al., 2021; Abdelfattah et al., 2021; Ning et al., 2021) on NAS using easy-to-compute metrics as proxies (without training), we present a novel method of searching for exploitable arches based on neural tangent kernel (NTK) (Jacot et al., 2018) without training the target model or trigger generator. Intuitively, NTK describes model training dynamics by gradient descent (Jacot et al., 2018; Chizat et al., 2019; Lee et al., 2019) . In the limit of infinite-width DNNs, NTK becomes constant, which allows closed-form statements to be made about model training. Recent work (Chen et al., 2021; Mok et al., 2022) shows that NTK serves as an effective predictor of model "trainability" (i.e., how fast the model converges at early training stages). Formally, considering model f (parameterized by θ) mapping input x to a probability vector f (x; θ) (over different classes), the NTK is defined as the product of the Jacobian matrix: Θ(x, θ) ≜ ∂f (x; θ) ∂θ ∂f (x; θ) ∂θ ⊺ Let λ min (λ max ) be the smallest (largest) eigenvalue of the empirical NTK Θ(θ) ≜ E (x,y)∼D Θ(x, θ). The condition number κ ≜ λ max /λ min serves as a metric to estimate model trainability (Chen et al., 2021) , with a smaller conditional number indicating higher trainability. In our context, we consider the trigger generator and the target model as an end-to-end model and measure the empirical NTK of the trigger generator under randomly initialized θ: Θ(ϑ) ≜ E (x,y)∼D,θ∼P θ• ∂f (g(x; ϑ); θ) ∂ϑ ∂f (g(x; ϑ; θ) ∂ϑ ⊺ ( ) where P θ• represents the initialization distribution of θ. Here, we emphasize that the measure should be independent of θ's initialization. Intuitively, Θ(ϑ) measures the trigger generator's trainability with respect to a randomly initialized target model. The generator's trainability indicates the easiness of effectively generating input-aware triggers, implying the model's vulnerability to input-aware backdoor attacks. To verify the hypothesis, on the CIFAR10 dataset with the generator configured as in Appendix § A, we measure Θ(ϑ) with respect to 900 randomly generated arches as well as the model accuracy (ACC) on clean inputs and the attack success rate (ASR) on trigger-embedded inputs. Specifically, for each arch α, we first train the model f α to measure ACC and then train the trigger generator g with respect to f α on the same dataset to measure ASR, with results shown in Figure 2 . Observe that the conditional number of Θ(ϑ) has a strong negative correlation with ASR, with a smaller value indicating higher attack vulnerability; meanwhile, it has a limited correlation with ACC, with most of the arches having ACC within the range from 80% to 95%. Leveraging the insights above, we present a simple yet effective algorithm that searches for exploitable arches without training, which is a variant of regularized evolution (Real et al., 2019; Mellor et al., 2021) . As sketched in Algorithm 1, it starts from a candidate pool A of n arches randomly sampled from a pre-defined arch space; at each iteration, it samples a subset A ′ of m arches from A, randomly mutates the best candidate (i.e., with the lowest score), and replaces the oldest arch in A with this newly mutated arch. In our implementation, the score function is defined as the condition number of Eq. 6; the arch space is defined to be the NATS-Bench search space (Dong & Yang, 2020) , which consists of 5 atomic operators {none, skip connect, conv 1 × 1, conv 3 × 3, and avg pooling 3 × 3}; and the mutation function is defined to be randomly substituting one operator with another. Algorithm 1: EVAS Attack 

4. EVALUATION

We conduct an empirical evaluation of EVAS on benchmark datasets under various scenarios. The experiments are designed to answer the following key questions: (i) does it work? -we evaluate the performance and vulnerability of the arches identified by EVAS; (ii) how does it work? -we explore the dynamics of EVAS search as well as the characteristics of its identified arches; and (ii) how does it differ? -we compare EVAS with conventional backdoors in terms of attack evasiveness, transferability, and robustness.

4.1. EXPERIMENTAL SETTING

Datasets. In the evaluation, we primarily use three datasets that have been widely used to benchmark NAS methods (Chen et al., 2019; Li et al., 2020; Liu et al., 2019; Pham et al., 2018; Xie et al., 2019) : CIFAR10 (Krizhevsky & Hinton, 2009) , which consists of 32×32 color images drawn from 10 classes; CIFAR100, which is similar to CIFAR10 but includes 100 finer-grained classes; and ImageNet16, which is a subset of the ImageNet dataset (Deng et al., 2009) down-sampled to images of size 16×16 in 120 classes. 0 2 1 3 0 2 1 3 0 2 1 3 Conv 1×1 Conv 3×3 Avg Pool 3×3 Skip Connect (i) EVAS (ii) Random 1 (iii) Random 2 Search space. We consider the search space defined by NATS-Bench ( Dong et al. ( 2021)), which consists of 5 operators {none, skip connect, conv 1 × 1, conv 3 × 3, and avg pooling 3 × 3} defined among 4 nodes, implying a search space of 15,625 candidate arches. Baselines. We compare the arches found by EVAS with ResNet18 (He et al., 2016) , a manually designed arch. For completeness, we also include two arches randomly sampled from the NATS-Bench space, which are illustrated in Figure 3 . By default, for each arch α, we assume the adversary trains a model f α and then trains the trigger generator g with respect to f α on the same dataset. We consider varying settings in which the victim directly uses f α , fine-tunes f α , or only uses α and re-trains it from scratch (details in § 4.4). Metrics. We mainly use two metrics, attack success rate (ASR) and clean data accuracy (ACC). Intuitively, ASR is the target model's accuracy in classifying trigger inputs to the adversary's target class during inference, which measures the attack effectiveness, while ACC is the target model's accuracy in correctly classifying clean inputs, which measures the attack evasiveness. The default parameter setting and the trigger generator configuration are deferred to Appendix § A.

4.2. Q1: DOES EVAS WORK?

Figure 3 illustrates one sample arch identified by EVAS on the CIFAR10 dataset. We use this arch throughout this set of experiments to show that its vulnerability is at the arch level and universal across datasets. To measure the vulnerability of different arches, we first train each arch using clean data, then train a trigger generator specific to this arch, and finally measure its ASR and ACC. Table 1 reports the results. We have the following observations. First, the ASR of EVAS is significantly higher than ResNet18 and the other two random arches. For instance, on CIFAR10, EVAS is 21.8%, 28.3%, and 34.5% more effective than ResNet18 and random arches, respectively. Second, EVAS has the highest ASR across all the datasets. Recall that we use the same arch throughout different datasets. This indicates that the attack vulnerability probably resides at the arch level and is insensitive to concrete datasets, which corroborates with prior work on NAS: one performant arch found on one dataset often transfers across different datasets (Liu et al., 2019) . This may be explained as follows. An arch α essentially defines a function family F α , while a trained model f α (•; θ) is an instance in F α , thereby carrying the characteristics of F α (e.g., effective to extract important features or exploitable by a trigger generator). Third, all the arches show higher ASR on simpler datasets such as CIFAR10. This may be explained by that more complex datasets (e.g., more classes, higher resolution) imply more intricate manifold structures, which may interfere with arch-level backdoors. To understand the attack effectiveness of EVAS on individual inputs, we illustrate sample clean inputs and their trigger-embedded variants in Figure 4 . Further, using GradCam (Selvaraju et al., 2017) , we show the model's interpretation of clean and trigger inputs with respect to their original and target classes. Observe that the trigger pattern is specific to each input. Further, even though the two trigger inputs are classified into the same target class, the difference in their heatmaps shows that the model pays attention to distinct features, highlighting the effects of input-aware triggers. Next, we explore the dynamics of how EVAS searches for exploitable arches. For simplicity, given the arch identified by EVAS in Figure 3 , we consider the set of candidate arches with the operators on the 0-3 (skip connect) and 0-1 (conv 3×3) connections replaced by others. We measure the ACC and ASR of all these candidate arches and illustrate the landscape of their scores in Figure 5 . Observe that the exploitable arch features the lowest score among the surrounding arches, suggesting the existence of feasible mutation paths from random arches to reach exploitable arches following the direction of score descent. Further, we ask the question: what makes the arches found by EVAS exploitable? Observe that the arch in Figure 3 uses the conv 1×1 and 3×3 operators on a number of connections. We thus generate arches by enumerating all the possible combinations of conv 1×1 and 3×3 on these connections and measure their performance, with results summarized in Appendix § B. Observe that while all these arches show high ASR, their vulnerability varies greatly from about 50% to 90%. We hypothesize that specific combinations of conv 1×1 and conv 3×3 create arch-level "shortcuts" for recognizing trigger patterns. We consider exploring the causal relationships between concrete arch characteristics and attack vulnerability as our ongoing work.

4.4. Q3: HOW DOES EVAS DIFFER?

To further understand the difference between EVAS and conventional backdoors, we compare the arches found by EVAS and other arches under various training and defense scenarios. Fine-tuning with clean data. We first consider the scenario in which, with the trigger generator fixed, the target model is fine-tuned using clean data (with concrete setting deferred to Appendix § A). Table 2 shows the results evaluated on CIFAR10 and CIFAR100. Observe that fine-tuning has a marginal impact on the ASR of all the arches. Take Random I as an example, compared with Table 1 , its ASR on CIFAR10 drops only by 7.40% after fine-tuning. This suggests that the effectiveness of fine-tuning to defend against input-aware backdoor attacks may be limited. Re-training from scratch. Another common scenario is that the victim user re-initializes the target model and re-trains it from scratch using clean data. We simulate this scenario as follows. After the trigger generator and target model are trained, we fix the generator, randomly initialize (using different seeds) the model, and train it on the given dataset. Table 3 compares different arches under this scenario. It is observed that EVAS significantly outperforms ResNet18 and random arches in terms of ASR (with comparable ACC). For instance, it is 33.4%, 24.9%, and 19.6% more effective than the other arches respectively. This may be explained by two reasons. First, the arch-level backdoors in EVAS are inherently more agnostic to model re-training than the model-level backdoors in other arches. Second, in searching for exploitable arches, EVAS explicitly enforces that such vulnerability should be insensitive to model initialization (cf. Eq. 4). Further, observe that, as expected, re-training has a larger impact than fine-tuning on the ASR of different arches; however, it is still insufficient to mitigate input-aware backdoor attacks. Fine-tuning with poisoning data. Further, we explore the setting in which the adversary is able to poison a tiny portion of the fine-tuning data, which assumes a stronger threat model. To simulate this scenario, we apply the trigger generator to generate trigger-embedded inputs and mix them with the clean fine-tuning data. Figure 6 illustrates the ASR and ACC of the target model as functions of the fraction of poisoning data in the fine-tuning dataset. Observe that, even with an extremely small poisoning ratio (e.g., 0.01%), it can significantly boost the ASR (e.g., 100%) while keeping the ACC unaffected. This indicates that arch-level backdoors can be greatly enhanced by combining with other attack vectors (e.g., data poisoning). Backdoor defenses. Finally, we evaluate EVAS against three categories of defenses, model inspection, input filtering, and model sanitization. Model inspection determines whether a given model f is infected with backdoors. We use Neural-Cleanse (Wang et al., 2019) as a representative defense. Intuitively, it searches for potential triggers in each class. If a class is trigger-embedded, the minimum perturbation required to change the predictions of inputs from other classes to this class is abnormally small. It detects anomalies using median absolute deviation (MAD) and all classes with MAD scores larger than 2 are regarded as infected. As shown in Table 4 , the MAD scores of EVAS's target classes on three datasets are all below the threshold. This can be explained by that NeuralCleanse is built upon the universal trigger assumption, which does not hold for EVAS. Input filtering detects at inference time whether an incoming input is embedded with a trigger. We use STRIP (Gao et al., 2019) as a representative defense in this category. It mixes a given input with a clean input and measures the self-entropy of its prediction. If the input is trigger-embedded, the mixture remains dominated by the trigger and tends to be misclassified, resulting in low self-entropy. However, as shown in Table 4 , the AUROC scores of STRIP in classifying trigger-embedded inputs by EVAS are all close to random guess (i.e., 0.5). This can also be explained by that EVAS uses input-aware triggers, where each trigger only works for one specific input and has a limited impact on others. Model sanitization, before using a given model, sanitizes it to mitigate the potential backdoors, yet without explicitly detecting whether the model is tampered. We use Fine-Pruning (Liu et al., 2018) as a representative. It uses the property that the backdoor attack typically exploits spare model capacity. It thus prunes rarely-used neurons and then applies fine-tuning to defend against pruning-aware attacks. We apply Fine-Pruning on the EVAS and ResNet18 models from Table 1 , with results shown in Table 5 . Observe that Fine-Pruning has a limited impact on the ASR of EVAS (even less than ResNet18). This may be explained as follows. The activation patterns of input-aware triggers are different from that of universal triggers, as each trigger may activate a different set of neurons. Moreover, the arch-level backdoors in EVAS may not concentrate on individual neurons but span over the whole model structures.

5. CONCLUSION

This work studies the feasibility of exploiting NAS as an attack vector to launch previously improbable attacks. We present a new backdoor attack that leverages NAS to efficiently find neural network architectures with inherent, exploitable vulnerability. Such architecture-level backdoors demonstrate many interesting properties including evasiveness, transferability, and robustness, thereby greatly expanding the design spectrum for the adversary. We believe our findings raise concerns about the current practice of NAS in security-sensitive domains and point to potential directions to develop effective mitigation. Here, we measure the NTK conditional number of the target model f under random initialization using the implementation of (Chen et al., 2021) and its corresponding ASR and ACC. Figure 7 shows their correlation. Observed that the NTK conditional number is negatively correlated with ACC (with Kendall's coefficient τ = -0.385) and has a very weak correlation with ASR (with τ = 0.100), which is consistent with (Chen et al., 2021) . The difference between Figure 2 and Figure 7 can be explained as follows. Figure 2 measures the NTK conditional number κ g of the trigger generator g (with respect to the randomly initialized target model f ), which indicates g's trainability (or f 's vulnerability). As backdoor attacks embed two functions (one classifying clean inputs and the other classifying trigger inputs) into the same model, there tends to exist a natural trade-off between ASR and ACC. Therefore, κ g shows a negative correlation with ASR and a weak positive correlation with ACC. Meanwhile, Figure 7 measures the NTK conditional number κ f of the target model f , which indicates f 's trainability. Therefore, κ f shows a negative correlation with ACC but a very weak correlation with ASR.

B.2 ASR-ACC TRADE-OFF

Figure 8 shows the correlation between the ASR and ACC of sampled arches (with Kendall's coefficient τ = -0.390). Intuitively, as backdoor attacks embed two functions (one classifying clean inputs and the other classifying trigger inputs) into the same model, there tends to exist a natural trade-off between ASR and ACC. This trade-off also implies that it is feasible to properly optimize ASR only to find performant but vulnerable arches. 

B.3 INTERPRETABILITY VERSUS VULNERABILITY

To understand the possible correlation between the attack vulnerability of an arch α and its interpretability, we compare the interpretation of each model f α regarding 100 clean inputs using GradCam (Selvaraju et al., 2017) . Figure 9 illustrates sample inputs and their interpretation by different models. Further, to quantitatively measure the similarity of interpretation, we use the intersection-over-union (IoU) score, which is widely used in object detection to compare model predictions with ground-truth bounding boxes. Formally, the IoU score of a binary-valued heatmap m with respect to another map m ′ is defined as their Jaccard similarity: IoU(m) = |O(m) ∩ O(m ′ )| |O(m) ∪ O(m ′ )| where O(m) denotes the set of non-zero elements in m. In our case, as the values of heatmaps are floating numbers, we first apply thresholding to binarize the values. Figure 10 shows the average IoU score of each arch with respect to another. Observe that (i) the arches generated by NAS (EVAS, Random I, and Random II) have more similar interpretability among themselves than manually



In the following, we use "arch" for short of "architecture".



t e x i t s h a 1 _ b a s e 6 4 = " 6 t R D b U 4 f H U w 4 V + F Y X A 6 U D b M I t 7 M = " > A A A C e n i c b Z H J a h t B E I Z b 4 y y 2 s n g 7

7 w u 7 B a g 8 k b L o R 0 o m 3 4 F J 1 J B u 9 w x u 9 E s A l d w y e Z n T V 9 7 s O E k j z V C v k c 7 r f t X 7 r t h m u k m 9 4 / F B c n g / T 9 4 P T r S e / s 0 + o u 2 + y I H b P X L G U f 2 B n 7 w s 7 Z k A l m 2 A / 2 k / 3 q 3 E f H 0 Z v o 7 R K N O q u e Q 7 Y W 0 e k f 6 U H E X g = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " T 8 e D i l D K I B v B / k B I s N F d P p 6 3 C I 0

e Z n b e D L g P H S r y V C v k C 3 j Q t n / p t h u u k W 5 7 f 1 W c H A 7 T l 8 P n n w / 7 R + / X d 9 l h j 9 h j d s B S 9 o o d s U / s m I 0 Y s O / s B / v J f k W 7 0 b P o T f R u h U a d 9 Z 8 H b C O i j 3 8 A Z M / J S g = = < / l a t e x i t > f ↵ (•; ✓) t e x i t s h a 1 _ b a s e 6 4 = " e t N J S 7 l

Figure 1: Attack framework of EVAS. (1) The adversary applies NAS to search for arches with exploitable vulnerability; (2) such vulnerability is retained even if the models are trained using clean data; (3) the adversary exploits such vulnerability by generating trigger-embedded inputs.

Figure 2: The conditional number of NTK versus the model performance (ACC) and vulnerability (ASR).

Figure 3: Sample arch identified by EVAS in comparison of two randomly generated arches.

Figure 4: Sample clean and trigger-embedded inputs as well as their GradCam interpretation by the target model.

Figure 5: Landscape of candidate arches surrounding exploitable arches with their ASR, ACC, and scores.

Figure 6: Model performance on clean inputs (ACC) and attack performance on trigger-embedded inputs (ASR) of EVAS as a function of poisoning ratio.

= Cout = 128 ConvBNReLU 3x3 Cin = 128, Cout = 64 block 2 Upsample scale_factor = 2 ConvBNReLU 3x3 Cin = Cout = 64 ConvBNReLU 3x3 Cin = 64, Cout = 32 block 2 Upsample scale_factor = 2 ConvBNReLU 3x3 Cin = Cout = 32 ConvBN 3x3 Cin = 32, Cout = mask_generator ?

Figure 7: The NTK conditional number of target model versus the model performance (ACC) and vulnerability (ASR).

Figure 8: Trade-off between model performance (ACC) and vulnerability (ASR).

Model performance on clean inputs (ACC) and attack performance on trigger-embedded inputs (ASR) of EVAS, ResNet18, and two random arches.

Model performance on clean inputs (ACC) and attack performance on trigger-embedded inputs (ASR) of EVAS, ResNet18, and two random arches after fine-tuning.

Model performance on clean inputs (ACC) and attack performance on trigger-embedded inputs (ASR) of EVAS, ResNet18, and two random arches after re-training from scratch.

Detection results of NeuralCleanse and STRIP for EVAS. NeuralCleanse shows the MAD score and STRIP shows the AUROC score of binary classification.

Model performance on clean inputs (ACC) and attack performance on trigger-embedded inputs (ASR) of EVAS and ResNet18 after Fine-Pruning.

Default parameter setting.

Generator network architecture.

ACKNOWLEDGMENTS

We thank anonymous reviewers and shepherd for valuable feedback. This work is partially supported by the National Science Foundation under Grant No. 2212323, 2119331, 1951729, and 1953893. Any opinions, findings, and conclusions or recommendations are those of the authors and do not necessarily reflect the views of the National Science Foundation. S. Ji is partly supported by the National Key Research and Development Program of China under No. 2022YFB3102100, and NSFC under No. 62102360 and U1936215.

A EXPERIMENTAL SETTING

A.1 PARAMETER SETTING 

B.4 ABLATION OF ATTACK EVASIVENESS

The evasiveness of EVAS may be accounted by (i) input-dependent triggers (Nguyen & Tran, 2020) and (ii) arch-level vulnerability. Here, we explore the contribution of input-dependent triggers to the attack evasiveness. We train the trigger generator with respect to different arches (EVAS, ResNet18, random arches) and run NeuralCleanse and STRIP to detect the attacks, with results summarized in Table 8 . Observed that while the concrete measures vary, all the attacks have MAD scores below the threshold and AUROC scores close to random guess, indicating that the input-dependent triggers mainly account for the attack evasiveness with respect to NeuralCleanse and STRIP. We generate neighboring arches by enumerating all possible combinations of conv 1×1 and conv 3×3 on the connections of the arch identified by EVAS ("|{0} ∼ 0|+|{1} ∼ 0|{2} ∼ 1|+|skip_connect ∼ 0|{3} ∼ 1|{4} ∼ 2|"). The ASR and ACC of these arches are summarized in Table 9 .

