BACKDOORS STUCK AT THE FRONTDOOR: MULTI-AGENT BACKDOOR ATTACKS THAT BACKFIRE

Abstract

Malicious agents in collaborative learning and outsourced data collection threaten the training of clean models. Backdoor attacks, where an attacker poisons a model during training to successfully achieve targeted misclassification, are a major concern to train-time robustness. In this paper, we investigate a multi-agent backdoor attack scenario, where multiple attackers attempt to backdoor a victim model simultaneously. A consistent backfiring phenomenon is observed across a wide range of games, where agents suffer from a low collective attack success rate. We examine different modes of backdoor attack configurations, non-cooperation / cooperation, joint distribution shifts, and game setups to return an equilibrium attack success rate at the lower bound. The results motivate the re-evaluation of backdoor defense research for practical environments.

1. INTRODUCTION

Beyond training algorithms, the scale-up of model training depends strongly on the trust between agents. In collaborative learning and outsourced data collection training regimes, backdoor attacks and defenses (Gao et al., 2020; Li et al., 2021) are studied to mitigate a single malicious agent that perturbs train-time images for targeted test-time misclassifications. Outsourced data collection is common amongst industry practitioners, where Taigman et al. (2014) ; Papernot (2018) find a strong reliance on web-scraped data or third-party sourcing. Kumar et al. (2020) also find that data poisoning is viewed by practitioners as their most serious threat. In many practical situations, It is plausible in practice for >1 attacker, such as the poisoning of crowdsourced and agent-driven datasets on Google Images (hence afflicting subsequent scraped datasets) and financial market data respectively, or poisoning through human-in-the-loop learning on mobile devices or social network platforms. In this paper, we investigate the under-represented aspect of agent dynamics in backdoor attacks: what happens when multiple backdoor attackers are present? We simulate different configurations to study how the payoff landscape changes for attackers, with respect to standard attack/defense configurations, cooperative/non-cooperative behaviour, and joint distribution shifts. Our key contributions are: • We explore the novel scenario of the multi-agent backdoor attack. Our findings on the backfiring effect and a low equilibrium attack success rate indicate a stable, natural defense against backdoor attacks, and motivates us to propose the multi-agent setting as a baseline in future research. • We introduce a set of cooperative dynamics between multiple attackers, extending on existing backdoor attack procedures with respect to trigger pattern generation or trigger label selection. • We vary the sources of distribution shift, from just multiple backdoor perturbations to the inclusion of adversarial and stylized perturbations, to investigate changes to a wider scope of attack success.

2. RELATED WORK

Backdoor Attacks. We refer the reader to Gao et al. (2020) ; Li et al. (2021) for detailed backdoor literature. In poisoning attacks (Alfeld et al., 2016; Biggio et al., 2012; Jagielski et al., 2021; Koh & Liang, 2017; Xiao et al., 2015) , the attack objective is to reduce the accuracy of a model on clean samples. In backdoor attacks (Gu et al., 2019a) , the attack objective is to maximize the attack success rate in the presence of the trigger while retain the accuracy of the model on clean samples. To achieve this attack objective, there are different variants of attack vectors, such as code poisoning (Bagdasaryan & Shmatikov, 2021; Xiao et al., 2018) , pre-trained model tampering (Yao et al., 2019; Ji et al., 2018; Rakin et al., 2020) , or outsourced data collection (Gu et al., 2019a; Chen et al., 2017; Shafahi et al., 2018b; Zhu et al., 2019b; Saha et al., 2020; Lovisotto et al., 2020; Datta & Shadbolt, 2022b) . We specifically evaluate backdoor attacks manifesting through outsourced data collection. Though the attack vectors and corresponding attack methods vary, the principle is consistent: model weights are modified such that they achieve the backdoor attack objective. Multi-Agent Attacks. Backdoor attacks (Suresh et al., 2019; Wang et al., 2020; Bagdasaryan et al., 2020; Huang, 2020) and poisoning attacks (Hayes & Ohrimenko, 2018; Mahloujifar et al., 2018; 2019; Chen et al., 2021; Fang et al., 2020) against federated learning systems and against multi-party learning models have been demonstrated, but with a single attacker intending to compromise multiple victims (i.e. single attacker vs multiple defenders); for example, with a single attacker controlling multiple participant nodes in the federated learning setup (Bagdasaryan et al., 2020) ; or decomposing a backdoor trigger pattern into multiple distributed small patterns to be injected by multiple participant nodes controlled by a single attacker (Xie et al., 2020) . Our multi-agent backdoor attack could be evaluated extensibly in federated learning, where multiple attackers control distinctly different nodes to backdoor the joint model. Though not a multi-agent attack, Xue et al. (2020) ; Nguyen & Tran (2020) ; Salem et al. (2021) make use of multiple trigger patterns in their single-agent backdoor attack. Xue et al. (2020) proposed an 1-to-N attack, where an attacker triggers multiple backdoor inputs by varying the intensity of the same backdoor, and N-to-1 attack, where the backdoor attack is triggered only when all N backdoor (sub)-triggers are present. Though its implementation of multiple triggers are for the purpose of maximizing a single-agent payoff, we reference its insights in evaluating a low-distance-triggers, cooperative attack in (E4) . Our work is unique because: (i) prior work evaluates a single attacker against multiple victims, while our work evaluates multiple attackers against each other and a defender; (ii) our attack objective is strict and individualized for each attacker (i.e. in a poisoning attack, each attacker can have a generalized, attacker-agnostic objective of reducing the standard model accuracy, but in a backdoor attack, each attacker has an individualized objective with respect to their own trigger patterns and target labels). Our work is amongst the first to investigate this conflict between the attack objectives between multiple attackers, hence the resulting backfiring effect does not manifest in existing multiagent attack work. 3 MULTI-AGENT BACKDOOR ATTACK

3.1. GAME DESIGN

The scope of our analysis is that the multi-agent backdoor attack is a single-turn game, composed of N attackers and M defenders. The game environment is a joint dataset D that agents contribute private datasets D i (attacker train-time set) towards (Figure 1 ). After private dataset contributions are complete and D is set, payoffs are computed with respect to test-time inputs (attacker run-time set) evaluated on a model trained by the defender on D (defender train set & validation set). Section 3.1 defines agent dynamics. Section 3.2 informs us how the relative distance between backdoor trigger patterns and trigger selection induces the backfire effect, and introduces the analysis of the insertion of subnetwork gradients. Appendix A.1 provides supplementary preliminaries and proofs for this section. Let X ∈ R l×w×c and Y = 1, 2, ..., k be the corresponding input and output spaces. {D i } N , D \ {D i } ∼ X × Y are sources of shifted X :Y distributions from which an observation x can be sampled. x can be decomposed x = x + ε, where x is the set of clean features in x, and ε : {ε ≥ 0} N +1 is the set of perturbations that can exist. The features x are i. An attacker a i would like to maximize their payoff (Eqt 1), the attack success rate (ASR), which is the rate of misclassification of backdoored inputs X poison , from the clean label Y clean i to the target poisoned label Y poison i , by the defender's model f . The attacker prefers to keep poison rate p i low to generate imperceptible and stealthy perturbations. The attacker strategy, formulated by its actions, is denoted as (ε i , p i , Y poison i , b i ). The predicted output would be Y = f (X i ; (θ, D); (r j , s j ); (ε i , p i , Y poison i , b i )). We compute the accuracy of the predicted outputs in test-time against the target poisoned labels as the payoff π = Acc( Y , Y poison ). Each attacker optimizes their actions against the collective set of actions of the other ¬i attackers. Defender's Parameters: Each defender is a player {d j } j∈M that trains a model f on the joint dataset D, which may contain backdoored inputs, until it obtains model parameters θ. In our analysis, there is one defender only (M = 1). In terms of information, the defender can view and access the joint dataset and contributions D, but is not given information on attacker actions (e.g. which inputs are poisoned). To formulate the defender's strategies {(r j , s j )} j∈M , the defender can choose a model architecture (action r j ) and backdoor defense (action s j ). The predicted label can be evaluated against the target poison label or the clean label. The 3 main ASR metrics: 1 run-time accuracy of the predicted labels with respect to (w.r.t.) poisoned labels given backdoored inputs, 2 run-time accuracy of the predicted labels w.r.t. clean labels given backdoored inputs, 3 run-time accuracy of the predicted labels w.r.t. clean labels given clean inputs. The defender's primary objective is to minimize the individual and collective attack success rate of a set of attackers (minimize 1 ), and its secondary objective is to retain accuracy against clean inputs (maximize 3 ). In this setup, we focus on minimizing the collective attack success rate, hence the defender's payoff can be approximated as the complementary of the mean attacker payoff (Eqt 2). We denote the collective attacker payoff and defender payoff, the utility functions of the game, as π a = mean ± std and π d = ((1mean) ± std) respectively.          π ai = Acc f (X i ; (θ, D); (r j , s j ); {(ε i , p i , Y poison i , b i ), (ε ¬i , p ¬i , Y poison ¬i , b ¬i )}), Y poison i (1) π d = 1 - 1 N N i Acc(f (•), Y poison i ) (2)

3.2. INSPECTING SUBNETWORK GRADIENTS

A distribution shift is a divergence between 2 distributions of features with respect to their labels. Distribution shifts vary by source of distribution (e.g. domain, task, label shift) and variations per source (e.g. multiple backdoor triggers, multiple domains). Joint distribution shift is a distribution shift attributed to multiple sources and/or variations per source. Eqt 8 is an example of how the multi-agent backdoor attack (multiple variations of backdoor attack) alters the probability density functions per label. Suppose θ t-1 has been optimized with respect to the clean samples D \ {D i } at iteration t -1, and in the next iteration t we sample a (subnetwork) gradient ϕ ∼ Φ to minimize the loss on distributionally-shifted samples D. At least one optimal ϕ i = θ t -θ t-1 exists that maps distributionally-shifted data to ground-truth labels ϕ : ε i → y i . We can inspect the insertion of subnetwork gradients. In our analysis, the gradient ϕ is a subnetwork gradient corresponding to a specific shift: θ t = θ t-1 + |{ϕ}| i ϕ i . Supporting explanation and proofs for Theorems 1 and 2 are provided in Appendix 6.1.4 and 6.1.6.  Theorem 1. Let x, y ∼ D \ (D 0 ∪ {D i } N ) and (x + ε noise + {ε i } N ), (y → y i ) ∼ D i = f (x + ε; θ) ∼ U(Y) s.t. P(y * ) = 1 |Y| . Theorem 2. A model of fixed capicity permits θ with limited subnetworks. Loss optimization condition (Eqt 16) constrains the insertion of subnetwork gradients ϕ to minimize total loss over the joint dataset. To satisfy the ϕ-insertion condition LHS < RHS (16), other than imbalancing the loss terms with high poison rate (Lemma 3), Eqt 17 shows how the transferability of ε determines whether its subnetwork gradient ϕ is accepted given ε → ϕ. It is empirically demonstrated |{ε : ϕ} * |≪ N .

4.1. DESIGN

Methodology. We implement the baseline backdoor attack algorithm BadNet (Gu et al., 2019b) with the adaptation of randomized pixels as unique backdoor trigger patterns per attacker (Appendix A.1.8) . We evaluate upon CIFAR10 dataset with 10 labels (Krizhevsky, 2009) . The real poison rate ρ of an attacker a i is the proportion of the joint dataset that is backdoored by ρ = |X poison i | |D| . For N attackers and V d being the proportion of the dataset allocated to the defender, the real poison rate is calculated as (Netzer et al., 2011) are a domain pair for digits. CI-FAR10 (Krizhevsky, 2009) and STL10 (Coates et al., 2011) are a domain pair for objects. ρ = (1 -V d ) × 1 N × p. Figure values out of 1.0; Capacity (Figure 2 ) We trained SmallCNN (channels [16, 32, 32] ), ResNet-{9, 18, 34, 50, 101, 152} (He et al., 2015) , Wide ResNet-{50, 101}-2 (Zagoruyko & Komodakis, 2016) , VGG-11 (Simonyan & Zisserman, 2015) .

(E3) Additional shift sources

The multiagent backdoor attack thus far manifests joint distribution shift in terms of increasing variations per source; how would it manifest if we increase sources? Adversarial perturbations ε a , introduced during test-time, are generated with the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) . Stylistic perturbations α → ε style (α = 1.0 means 100% stylization), introduced during train-time, are generated with Adaptive Instance Normalization (AdaIN) (Huang & Belongie, 2017) . Results are summarized in Figure 7 (E4) Cooperation of agents In this section, we wish to leverage agent dynamics into the backdoor attack by investigating: can cooperation between agents successfully maximize the collective attack success rate? The base case is N = 5, V d = 0.1, p, ε = 0.55; the last parameter applies to the N = 100 case; all 3 parameters apply to the Defense (Backdoor Adversarial Training w.r.t. E5 ) configurations case. We evaluate (non-)cooperation w.r.t. information sharing of input poison parameters and/or target poison label selection. We summarize the results for coordinated trigger generation in Table 4, and the lack thereof in Table 3 . We record the escalation of poison rate and trigger label selection in Figure 6 . (E5) Performance against Defenses In this section, we investigate: how do single-agent backdoor defenses affect the multi-agent backdoor attack payoffs? Defenses are evaluated on the Clean Label Backdoor Attack (Turner et al., 2019) in addition to BadNet. We evaluate 2 augmentative (data augmentation (Borgnia et al., 2021) , backdoor adversarial training (Geiping et al., 2021) ) and 2 removal (spectral signatures (Tran et al., 2018) , activation clustering (Chen et al., 2018) ) defenses. Results are summarized in Figure 3 . (E6) Model parameters inspection In this section, we investigate how model parameters change as N increases. To measure the likelihood that a set of trained models on different attack configurations contain similar subnetworks, we measure the distance in parameters, specifically the distance in parameters per layer for the original full DNN and pruned DNN. We prune SmallCNNs and generate the lottery ticket (subnetwork) with Iterative Magnitude Pruning (IMP) (Frankle & Carbin, 2019b) . Results are in Figure 8 .

4.2. FINDINGS

The main takeaway from our findings is the phenomenon, denoted as the backfiring effect, where a backdoor trigger pattern will trigger random label prediction and attain lower-bounded collective attack success rate 1 |Y| . The backfiring effect demonstrates the following properties: 1. (Observation 1) Backdoor trigger patterns tend to return random label predictions, and thus the collective attack success rate converges to the lower bound (Theorem 1). Optimal subnetworks per attacker are likely not inserted (Theorem 2). 2. (Observation 2) Observation 1 is resilient against most combinations of agent strategies, particularly variations in defense, and cooperative/anti-cooperative behavior. 3. (Observation 3) Adversarial perturbations are persistent and can co-exist in backdoored inputs while successfully lowering accuracy w.r.t. clean labels. 4. (Observation 4) Model parameters at N > 1 become distant compared to N = 1, but for varying N > 1 tend to be similar to each other. (Observation 1: Backdoor-induced randomness) Across (E1-6) , as N increases, the collective attack success rate decreases. In the presence of a backdoor trigger pattern, the accuracy w.r. (E3: {backdoor, stylized}) For run-time poison rate 1.0 at N = 1, stylized perturbations do not affect accuracy w.r.t. poisoned labels. At N = 100, stylized perturbations yield further decrease in accuracy w.r.t. poisoned labels. We would expect stylization to strengthen a backdoor trigger pattern, in-line with literature where backdoor triggers are piece-wise (Xue et al., 2020) . However, Theorem 1 argues the backfiring effect persists despite stylization, as the distribution of (ε style ∈ ε) → y i would still tend to be random. It suggests the unlikelihood of trigger strengthening (or joint saliency), even if only poisoned inputs are stylized. Hence, attackers should conform their data source to that of other agents. Defenders should also robustify the joint dataset against shift-inconsistencies; e.g. we expect augmentative defenses contribute to the backfiring effect and lower the accuracy w.r.t. poisoned labels. (E5) Some single-agent defenses counter the backfiring effect and increase collective attack success rate for BadNet and Clean-Label attacks. (Observation 2: Futility of optimizing against other agents) (E4: {poison rate}) Escalation is an intriguing aspect of this attack, as the payoffs have as much to do with the order in which attackers coordinate, as they do with individual attack configurations. In Figure 6 (right), the escalation of poison rate affects the distribution of individual attack success rates, but not the collective attack success rate. The interquartile range narrows when 80% of the attackers all escalate (inequal escalation), but returns to equilibrium once all attackers escalate to 100% to 0.55 (equal escalation). Non-uniform private datasets (e.g. heterogeneous label sets, stylization/domain shift, escalating ε), act against individual and collective ASR; attackers should prefer to coordinate such that their private dataset contributions approximate a single-agent attack. (E4: {target poison label}) In Table 3 , if all attackers coordinate the same target poison label, a multi-agent backdoor attack can be successful. It is unlikely attributable to solely feature collisions ||x i -x ¬i || 2 2 ≈ 0, as this pattern persists agnostic to cosine distance between backdoor trigger patterns. From an undefended multi-agent backdoor attack perspective, this would be considered a successful attack. Though the most successful attacker strategy, it is not robust to defender strategies: the worst-performing backdoor defense reduces the payoff substantially such that attackers attain a better expected payoff not coordinating label overlap (Table 6 ). Given the dominant strategy of the defender is to enforce a backdoor defense, the Nash Equilibrium (20.3,79.7)% is attained when attackers opt for random trigger patterns. Assuming attackers can coordinate a joint strategy of random trigger patterns and 100% trigger overlap, they can attain an optimal payoff of (27.4, 72.6)%. 100% label overlap works optimally with trigger patterns of low cosine distance. Orthogonal-coordinated trigger patterns return consistently-low collective attack success rates (Table 4 ). (E4: {backdoor trigger pattern}) In terms of sub-group cooperation, when 40% of attackers coordinate the same target label, there is no unilateral increase in their individual ASR compared to the other attackers at N = 5. For a large number of attackers (N = 100), in Table 3 and Figure 6 (left), when the sub-group of attackers coordinating their target labels increase, the collective ASR tends to increase and the distribution of individual ASR narrows. With respect to Theorem 2, it is empirically implicit that few backdoor subnetworks are inserted. The general pattern is when attackers exercise non-cooperative aggression non-uniformly, the distribution of their ASR widen, but when aggression is uniform, the distribution narrows down to the lower bound of ASR (mutually-assured destruction). (E4: {target poison label}) We evaluate attackers cooperatively generating trigger patterns that reduce feature collisions and minimize loss interference (Eqt 17), i.e. orthogonal and residing in distant regions of the input space. The collective ASR is low, even with 100% target label overlap. (E4: {backdoor trigger pattern, target poison label}) Coordinating low-or high-distance trigger patterns is futile. Attackers coordinating such that they share 1 identical backdoor trigger pattern and 1 identical target poison label will approximate a single-agent attack. Other than the downside of not being able to flexibly curate the attack to their needs (e.g. targeted misclassification), single-agent backdoor attacks are demonstrably mitigable. In Table 3 , where we have a set of low-distance trigger patterns, inadvertently due to a high ε, if attackers picked identical target poison labels despite non-identical backdoor trigger patterns, the collective ASR is high. This is in-line with results from Xue et al. (2020) , where the authors implemented 2 single-agent backdoor attacks with multiple trigger patterns with expectedly low distance from each other (one attack where the trigger patterns are of varying intensity of one pattern; another attack where they compose different sub-patterns, and thus different combinations of these sub-patterns would compose different triggers of low-distance to each other), and demonstrated a high attack success rate. Similarly, our attackers share a trigger pattern sub-region (overlapping region between trigger patterns) that is salient during training (i.e. an agent-robust backdoor trigger sub-pattern). This cooperative setting could be interpreted as particularly weak, given the ease of defending against, and the requirement of attackers sharing information that can be used against them (e.g. anti-cooperative behaviour). (Observation 3: Resilient adversarial perturbations) (E3: {adversarial, stylized}) For run-time poison rate 0.0 (backdoored at train-time, not run-time), adversarial perturbations with respect to a private dataset, despite varyimg texture shift between private datasets, can attain high adversarial attack success rate (low accuracy w.r.t. clean labels) in a multi-agent backdoor attack. An attacker can still pursue an adversarial attack strategy despite multiple agents; this may not always be practical is the attacker requires a misclassification of a specific target label (as demonstrated in this experiment). (E3: {backdoor, adversarial, stylized}) Low ε b , p and (E3: {backdoor, adversarial}) increasing ε a yields increasing backdoor ASR (accuracy w.r.t. poisoned labels, run-time poison rate 1.0). High ε b , p and increasing ε a yields decreasing backdoor ASR. Interference takes place between adversarial and backdoor perturbations: when p is low against the surrogate model's gradients, FGSM is optimized towards pushing the inputs towards the poisoned label, but when p is high then FGSM is optimized towards pushing inputs away from the poisoned label. (Observation 4: Increasingly-distant model parameters) (E6) The weights for N = 1 are far from the weights for N > 1. The weights for N > 1 are all close to each other. The distance between weights tend to increase down convolutional layers and decrease down fully-connected layers. The distance values are similar between full network parameters, mask of the lottery ticket, and lottery ticket parameters. This implies the new optima of the full network is specifically attributed to changes in the lottery ticket required to resolve the backdoor trigger patterns. Since the weights do not change significantly w.r.t. N |N > 1, particularly for the lottery ticket, it also implies there is no proportional number of subnetworks inserted, supporting that few backdoor subnetworks are inserted (Thm 2).

5. RECOMMENDATIONS & CONCLUSION

Motivated in pursuing practical robustness against backdoor attacks and machine-learning-at-large, we investigate the multi-agent backdoor attack, and extend the actions of attackers, such as a choice of adversarial attacks use in test-time, or a choice of cooperation or anti-cooperation. Aside from our findings, the main takeaways are as follow: 1. The backfiring effect acts as a natural defense against multi-agent backdoor attacks. Existing models may not require significant defenses to block multi-agent backdoor attacks. If it is likely that multiple attackers can exist, then the defender could focus on other aspects of model robustness other than backdoor robustness. This motivates backdoor defenses in practical settings, as most backdoor defenses are directed to single-attacker setups. 2. We are cautioned that the effectiveness of existing (single-agent) backdoor defenses drop when the number of attackers increase, thus they may not be prepared to robustify models against multi-agent backdoor attacks. We recommend further study into multi-agent backdoor defenses. Henceforth, we recommend using the multi-agent setting as a baseline for practical backdoor attack/defense work. In addition to evaluating prospective defenses against a backdoor attack with no defenses, we may wish to evaluate it against a "natural setting" baseline (no defenses, purely multi-agent attacks e.g. N = 100). We also recommend the evaluation of a prospective attack in a multi-agent setting (how robust is the attack success rate when multiple attackers are present). Shifting away from the focus of new attack designs optimized towards defenses, we may also consider optimizing attack designs against this backfiring effect.

A APPENDIX

A.1 METHODOLOGY (EXTENDED) A.1.1 GAME DESIGN (EXTENDED) In this multi-agent training regime, there two types of agents: defenders and participants. Participants can be classified as either attackers and non-attackers. To simplify the discussion and analysis, we evaluate the setup in terms of attackers and defenders (experimentally, a non-attacking participant would approximate a defender with larger dataset allocation). A multi-agent and single-agent attack are backdoor attacks with multiple and single attackers respectively. π a,d = Acc(Q( q)|U ( u)), Acc(U ( u)|Q( q)) q, u := arg max q,u Acc(Q(q)|U (u)) ∩ arg max q,u Acc(U (u)|Q(q)) := arg max q,u Acc(Q(q)|U (u)) ∩ arg min q,u Acc(Q(q)|U (u)) , where Acc(U (u)|Q(q)) = 1 -Acc(Q(q)|U (u)); := (q, u)w w∈W ∩ (q, u)v v∈V := {(q, u)w=v} w,v∈W,V Equilibrium payoffs. In setups where attackers are only playing against attackers, the equilibrium π ai,a¬i is the collective payoff π a of the highest value in the payoff matrix: π ai,a¬i = max(mean ± std). For setups where attackers are playing against defenders, the equilibrium π a,d is the collective payoff (π a , π d ) where both payoff values are maximized with respect to the dominant strategy taken by the other. We demonstrate this procedure in Eqt 3, where we map strategy indices q, u for each agent by Q, U respectively: Q : q → (ε, p, Y poison , b), U : u → (r, s). From this result for π a,d , we find that the (q, u)-optimization procedure is one where the objective is to jointly maximize and minimize Acc w.r.t. (q, u), and payoffs at (q, u) w=v are the Nash equilibria. It is additionally indicated the backdoor attack, as well as the multi-agent backdoor attack, is a zero-sum game, given that if the total gains of agents are added up and the total losses are subtracted, they will sum to zero. A.1.2 PRELIMINARIES ON SUBNETWORK GRADIENTS θt := θt-1 - X,Y x,y ∂L(x, y) ∂θ ⇒ ϕX,Y = - X,Y x,y ∂L(x, y) ∂θ (4) Suppose the optimization of the parameters θ is viewed as a discrete optimization process, where each iteration samples a gradient from a set of gradients ϕ ∼ Φ (Eqt 4), such that the total loss L decreases. In this analysis, we segregate the θ-update with respect to clean data and distributionally-shifted data. Suppose θ t-1 has been optimized with respect to the clean samples D \ {D i } at iteration t -1, and in the next iteration t we sample ϕ ∼ Φ to minimize the loss on distributionally-shifted samples D. An example is the change in the probability density functions per class between before (Eqt 5) and after (Eqt 6) the train-time distribution is backdoor-perturbed. At least one optimal ϕ i = θ t -θ t-1 exists that can map distributionally-shifted data to ground-truth labels ϕ : ε i → y i . Hence, Φ is a set that contains a set of endpoint gradients {ϕ i } N as well as a set of interpolated gradients ϕ i ϕ ¬i . Frankle & Carbin (2019a) showed in their work on the lottery ticket hypothesis that a DNN can be decomposed into a pruned subnetwork that carries the same functional similarity and accuracy to the full DNN. An (optimal) subnetwork θ ⊙ m is the collection of the minimum number of nodes required for the prediction of a ground-truth class with respect to the set of features, where mask m ∈ {0, 1} |θ| determines the indices in θ not zeroed out. Subsequent works, such as MIMO (Havasi et al., 2021) , show that multiple subnetworks can exist in a DNN, each subnetwork approximating a sub-function that predicts the likelihood a feature pertains to a specific class. Moreover, Qi et al. (2021b) show that a backdoor trigger can be formulated as a subnetwork and only occupies small portion of a DNN, and that in their work each subnetwork occupied 0.05% of model capacity. The subsequent iteration is thus evaluating the selection of subnetworks to insert into θ, where each subnetwork corresponds to a specific shifted function. Hence, the gradient ϕ is a combination of the various functional subnetwork gradients that can be inserted while satisfying condition 16. Interpolated gradients ϕ i ϕ ¬i = (θ t -θ t-1 ) ⊙ ( N i m i ) are gradients with different combinations of subnetwork masks and subnetwork values assigned in m i and θ t accordingly; endpoint ϕ i = ϕ i ϕ ¬i = (θ t -θ t-1 ) ⊙ m i . For our analysis of the multi-agent backdoor attack with respect to joint distribution shift, the gradient ϕ is a subnetwork gradient corresponding to a specific shift ε → ϕ (e.g. backdoor trigger pattern, or sub-population shift in clean inputs, or stylization): . θ t = θ t-1 + |{ϕ}| i ϕ i . A.1.3 PRELIMINARIES ON JOINT DISTRIBUTION SHIFT Distribution shifts can vary by source of distribution (e.g. domain shift, task shift, label shift) and variations per source (e.g. multiple backdoor triggers, multiple domains). Joint distribution shift is denoted as the phenomenon when distribution shift is attributed to multiple sources and/or variations per source. Eqt 8 is an example of how the multi-agent backdoor attack (multiple variations of backdoor attack) alters the probability density functions per label. To address joint distribution shift, ϕ should be transferable across a set of {ε}. One approach to inspecting this is by inspecting the insertion of subnetworks. There is growing literature on the study of joint distribution shift. Lemma 1. For an input variable X(ω) that is sampled randomly, the output variable X(ω) from operations ε applied to X(ω) will also tend to be random. Proof. A random variable X is a mapping from W to R, that is X(ω) ∈ R for ω ∈ R. X(ω) = X(ω) + ε, thus X is also a mapping X : W → R. The measure for random variable X is defined by the cumulative distribution function F (x) = P(X ≤ x). For x > 0, F X (x) = P(X ≤ x) = P(X + ε ≤ x) = P(X ≤ x -ε) = F X (x -ε). Thus X(ω) is also measurable and is a random variable defined on the sample space W . Lemma 2. Suppose a given model f (x, y; θ) = θ • x and loss L(x, y; θ) = f (x) -y. Suppose we sample backdoored observations (x i = x + ε i ), (y → y i ) ∼ D i . The change in loss between clean to perturbed input is ∂L ∂θ = ε(θ) + c. Proof.  ∆L = L(xi, yi; θ) -L(x, y; θ) = [f (xi, yi; θ) -f (x, y; θ)] -[yi -y] = θ[xi -x] -[yi -y] ∂ 2 L ∂θ 2 = xi -x = ε ∂L ∂θ = ε(θ) + c Theorem 1. Let x, y ∼ D \ (D 0 ∪ {D i } N ) and (x + ε noise + {ε i } N ), (y → y i ) ∼ D i D \ (D 0 ∪ {D i } N ), backdoored observations (x + ε noise + {ε i } N ), (y → y i ) ∼ D i . x = x + ε L(x, y) = L(x, y) + L(ε, y) ∂L(x, y) ∂θ = ∂L(x, y) ∂θ + ∂L(ε, y) ∂θ ⇒ θt := θt-1 - X,Y x,y ∂L(x, y) ∂θ - X,Y ε,y ∂L(ε, y) ∂θ This decomposition implies ∂L(x,y) ∂θ updates part of θ w.r.t. x, which we denote as θ ⊙ m x , and ∂L(ε,y) ∂θ updates part of θ w.r.t. ε, which we denote as θ ⊙ m ε , where m x , m ε ∈ {0, 1} |θ| are masks of θ ≡ θ ⊙ (m x + m ε ). Given the distances (squared Euclidean norm) between the shifted inputs and outputs x → x i and y → y i , we can enumerate the following 4 cases. Case ( 1) is approximately a single-agent backdoor attack, and is not evaluated. Cases (2)-( 4) are variations of shifts in inputs and labels in a backdoor attack and manifest in our experiments.            ||xi -x|| 2 2 ≈ 0 , ||yi -y|| 2 2 ≈ 0 (Case 1) ||xi -x|| 2 2 > 0 , ||yi -y|| 2 2 ≈ 0 (Case 2) ||xi -x|| 2 2 ≈ 0 , ||yi -y|| 2 2 > 0 (Case 3) ||xi -x|| 2 2 > 0 , ||yi -y|| 2 2 > 0 (Case 4) For ε = {ε i } i∈N +1 , if N → ∞, then ε ∼ Rand. We denote a random distribution Rand : s ∼ U(S) s.t. P(s) = 1 |S| , where an observation s is uniformly sampled from (discrete) set S. By Lemma 1 and 2, if ε ∼ Rand, then ∂L(ε,y;θ) ∂θ ∼ Rand and f (x i ; θ) -f (x; θ) ≈ f (ε; θ) ∼ Rand. Hence, for each case of ∂L(ε,y;θ) ∂θ : If ∂L(ε,y;θ) ∂θ ̸ = 0, given θ = θ ⊙ (m x + m ε ), then f (ε; θ) ≈ f (ε; θ + m ε ) ∼ Rand; If ∂L(ε,y;θ) ∂θ = 0, given m x = 1 |θ| , m ε = 0 |θ| , then f (ε; θ) ≈ f (ε; θ + m x ) ∼ Rand. In both cases, the predicted value of f will be sampled randomly. Given it randomly samples from the label space Y, in a multi-agent backdoor attack, and shifted input:output Cases (2)-( 4), it follows that under the presence of a backdoor trigger pattern a prediction y ∼ U(Y) s.t. P(y) = 1 |Y| . The lower bound of attack success rate would be 1 |Y| (0.1 for CIFAR-10). A.1.5 INSPECTING SUBNETWORK GRADIENTS: CHANGES IN PROBABILITY DISTRIBUTIONS W.R.T. X , Y -SPACE Theorem 3. Let x, y ∼ D\(D 0 ∪{D i } N ) and (x+ε noise +{ε i } N ), (y → y i ) ∼ D i be sampled clean and backdoored observations from their respective distributions. P x→y (x) denotes the probability density functions computing the likelihood that features of x map to label y. A model f can be approximated by P of all labels (Eqt 8). For any given pair of attacker indices (i, ¬i) and their corresponding backdoor trigger patterns (ε i , ε ¬i ) and target poison labels (y i , y ¬i ), we formulate the updated model f that can be approximated by P of all labels as Eqt 8. By analysis of cases and empirical results, the final prediction f (x) is skewed w.r.t. the distribution of {ε}. Proof sketch of Theorem 3. Inductively demonstrated with different attack scenarios, we show that the model as a function approximator is composed of multiple probability density functions corresponding to each backdoor mapping ε i : y i . No Attack (N=0). We sample a set of clean observations x, y ∼ D. P x→y (x) denotes the probability density functions computing the likelihood that features of x map to label y. A model f can be approximated by P of all labels P(x) = {P x→y (x) • P εnoise→y (ε noise )} y∈Y , i.e.: f (x; θ) = arg max y∈Y {Px→y(x) • Pε noise →y (εnoise)} (5) Single-Agent Backdoor Attack (N=1). We sample clean observations x, y ∼ D \ D 0 and backdoored observations (x + ε noise + ε 0 ), (y → y 0 ) ∼ D 0 , where ε 0 > 0 and y ̸ = y 0 . Sampling an input from the joint distribution x ∼ D where D = D 0 ∪ (D \ D 0 ), x would be evaluated by f with respect to all features (including perturbation feature). The newly-added perturbation feature ε 0 is evaluated by f , where it manifests in a given input or not (returns 0 if not), and requires a corresponding subnetwork gradient ϕ 0 . The proposed subnetwork gradient insertion ϕ 0 is accepted if Eqt 15 is satisfied. f (x; θ + ϕ0) = arg max y∈Y {Px→y(x) • Pε noise →y (εnoise) • Pε 0 →y (ε0)} (6) Multi-Agent Backdoor Attack (N=2). We sample clean observations x, y ∼ D \ (D 0 ∪ D 1 ), backdoored observations (x + ε noise + ε 0 ), (y → y 0 ) ∼ D 0 and (x + ε noise + ε 1 ), (y → y 1 ) ∼ D 1 , where ε 0 , ε 1 > 0 and y ̸ = y 0 , y 1 . There are 2 primary considerations to evaluate: (I) transfer/interference between features and labels between D 0 and D 1 ; and (II) loss reduction w.r.t. gradient selection. (I) manifests case-by-case, depending if in a particular case whether ||ε 0 -ε 1 || 2 2 > 0 or ||ε 0 -ε 1 || 2 2 ≈ 0, whether y 0 = y 1 or y 0 ̸ = y 1 . In terms of gradient selection, since there are at least 4 subnetwork gradient scenarios to evaluate: (i) no subnetwork gradient [θ], (ii) subnetwork gradient of ε 0 (endpoint) [θ + ϕ 0 ], (iii) subnetwork gradient of ε 1 (endpoint) [θ + ϕ 1 ], and (iv) interpolated subnetwork gradient between ε 0 and ε 1 [θ + ϕ 0 ϕ 1 ]. Sampling x ∼ D, each of these θ + ϕ are evaluated case-by-case in Eqt 7. Among these candidate subnetwork gradients, the inserted (combination) of subnetwork gradients is determined by Eqt 17. f (x; θ + ϕ) = arg max y∈Y {Px→y(x) • Pε noise →y (εnoise) • Pε 0 →y (ε0) • Pε 1 →y (ε1)} Multi-Agent Backdoor Attack (N>1). Extending on our study of the 2-Attacker scenario, for any given pair of attacker indices (i, ¬i), we need to consider the distances (squared Euclidean norm) of (ε i , ε ¬i ) and (y i , y ¬i ). By induction, we obtain Eqt 8, where ϕ is an interpolation of N subnetworks to varying extents. f (x; θ + ϕ) = arg max y∈Y Px→y(x) • Pε noise →y (εnoise) • N i Pε i →y (εi) We enumerate cases from Eqt 8, mapped similar to Theorem 1 cases. Note these are non-identical case mappings: Theorem 1 cases are evaluating distances between the unshifted and shifted inputs and labels in the joint dataset; Theorem 3 cases are evaluating distances between inputs and labels of private datasets of different attackers. (Case 1) If ||ε i -ε ¬i || 2 2 ≈ 0 and y i = y ¬i , attackers approximate a single attacker {ε 0 , y 0 }, hence the collective attack success rate should approximate that of a single-agent backdoor attack. (Case 3) If y i ̸ = y ¬i and ||ε i -ε ¬i || 2 2 ≈ 0, then the feature collisions arising due to this label shift will cause conflicting label predictions from each P εi→y (ε i ) in Eqt 8, which will skew the final label prediction. This manifests in escalation, where in E4 we observe that if |{ε i → y i } -|{ε ¬i → y ¬i }|, then the attack success rate of a i would be better than a ¬i . This manifests when there are a large number of attackers |ε|, where in E4 we observe that many attackers with low distance perturbations but randomly-assigned target trigger labels tend to result in low collective attack success rate. This phenomenon may arise due to the model returning random label predictions during test-time if provided random labels during train-time, in-line with Theorem 1, and extending upon Zhang et al. (2017) . (Cases 2 & 4) If ||ε i -ε ¬i || 2 2 > 0, whether y i = y ¬i or y i ̸ = y ¬i , given the backdoor trigger patterns are distant in the feature space (minimal feature collision), it follows that the collective attack success rate should be more dependent on model capacity to store a unique subnetwork for each ε. Empirically, this is neither in-line with respect to capacity findings in E2 nor in-line with trigger distance findings in E4 . This informs us that, although the cosine distance indicates a great distance between trigger patterns, feature collisions still occur in practice when ||ε i -ε ¬i || 2 2 > 0. It indicates that Case 3 (skewed label prediction) is more dominant in practice, and this is in-line with E4 where the cosine distance between trigger patterns are high, but y i = y ¬i returns higher collective attack success rate than when y i ̸ = y ¬i .

A.1.6 INSPECTING SUBNETWORK GRADIENTS: CHANGES IN LOSS TERMS

Theorem 2. A model of fixed capicity permits θ with limited subnetworks. Loss optimization condition (Eqt 16) constrains the insertion of subnetwork gradients ϕ to minimize total loss over the joint dataset. To satisfy the ϕ-insertion condition LHS < RHS (16), other than imbalancing the loss terms with high poison rate (Lemma 3), Eqt 17 shows how the transferability of ε determines whether its subnetwork gradient ϕ is accepted given ε → ϕ. It is empirically demonstrated |{ε : ϕ} * |≪ N . Proof sketch of Theorem 2. Inductively demonstrated with different attack scenarios, we show how the loss function evaluates the insertion of a subnetwork w.r.t. its gradients. Pursuing a loss perspective on this problem is motivated by implications from the transfer-interference tradeoff (Riemer et al., 2019) on feature transferability, by implications from imbalanced gradients (Jiang et al., 2021) on how loss terms can overpower optimization pathways, and by implications of transfer loss as an implicit distance metric. Single-Agent Backdoor Attack (N=1). We consider the loss minimization procedure at this iteration as an implicit measurement of the entropy of the backdoor subnetwork; if there is marginal information:capacity benefit from the insertion of ϕ to θ compared to not inserting it, then the subnetwork gradient is added to θ in this iteration. As θ is already optimized to D \ D i , therefore ∂L(x,y) ∂θ ≈ 0 and thus resulting in Property 9. This update θ → θ * is represented in Eqt 4, where the update condition L(θ ∪ ϕ backdoor ) < L(θ) is defined by Eqt 10, consisting of loss with respect to both clean and poisoned inputs. We denote LHS (10) and RHS (10) as the left-hand side and right-hand side of an update condition (10) respectively. The subnetwork would be updated based on update condition (10), where the insertion of the subnetwork would be rejected if LHS > RHS (10). We refactor into Eqt (11) as an update condition: if LHS > RHS (11), then a subnetwork gradient insertion is rejected. ∂L(x, y) ∂θ = ∂L(x, y) ∂θ + ∂L(ε, y) ∂θ ≈ ∂L(ε, y) ∂θ (9) L(θ + ϕ backdoor ; X clean , Y clean ) + L(θ + ϕ backdoor ; X poison , Y poison ) < L(θ; X clean , Y clean ) + L(θ; X poison , Y poison ) (10) L(θ + ϕ backdoor ; X clean , Y clean ) -L(θ; X clean , Y clean ) < L(θ; X poison , Y poison ) -L(θ + ϕ backdoor ; X poison , Y poison ) (11) We refactor Eqt 11 into Eqt 14 after decomposing the backdoor inputs into clean and backdoor trigger features. To reiterate, to insert a candidate subnetwork gradient ϕ, the aforementioned conditions 10-14 would need to be satisfied. To satisfy these conditions, at least 2 approaches can be taken: (Case 1) maximize the poison and perturbation rate, or (Case 2) jointly minimize the loss with respect to both clean inputs and backdoored inputs after the subnetwork gradient is inserted. From Figure 2 (N = 10 0 ), we know empirically that this condition can be satisfied for single-agent attacks. (Case 1) To maximize the poison and perturbation rate D 0 , ε alone, while keeping the loss values constant, we find that the lower bound required to satisfy conditions 10-14 is |ε| |x| ≥ 1 2 |D0|> 2|D \ D0| (Lemma 3). Causing an imbalance between the loss function terms is in-line with analysis in imbalanced gradients (Jiang et al., 2021) . Considering the number of poisoned samples affects the information:capacity ratio, if the exclusion of ϕ backdoor results in complete misclassification of X poison , then for the same capacity requirements, each backdoor subnetwork has a high information:capacity ratio and it is possible for ϕ backdoor to be accepted. (Case 2) Agnostic to substantial poison/perturbation rate increases (Case 1), the attacker can also aim to craft backdoor trigger patterns that share transferable features to clean features (e.g. backdoor trigger patterns generated with PGD (Turner et al., 2019) ). Given that ϕ is crafted such that it minimizes loss w.r.t. backdoored features (i.e. max(RHS -LHS) (11) or max(LHS -RHS) ( 14)), in order for Eqt 14 to be satisfied, the candidate subnetwork gradient will be accepted if it simultaneously minimizes loss w.r.t. clean features (i.e. min(LHS -RHS) (11) or min(RHS -LHS) ( 14)). We represent this dual condition as ∂L(x,y) ∂θ < 0 ∂L(ε,y) ∂θ < 0 . With sign(x) = +1 for x > 0 -1 for x < 0 , condition 15 constrains the gradients to be in the same direction and loss to decrease, and must be satisfied to accept the candidate subnetwork gradient. . Each individual poison rate is thus smaller than the sum of all poison rates, thus the number of backdoored inputs allocated per backdoor subnetwork is smaller. We also presume that any one backdoor poison rate is not greater than the clean dataset, i.e. (1 -p) > p n ∀n ∈ N . Note that (1 -p) includes not only the defender's clean contribution, but also clean inputs contributed by each attacker. Given the capacity limitations of a DNN, if the number of attackers N is very large resulting in many candidate subnetworks, not all of them can be inserted into θ. Given fixed model capacity, θ can only include a limited number of subnetworks, and this number depends on the extent each subnetwork carries information that can reduce loss for multiple backdoored sets of inputs (transferability). We can approximate this transferability by studying how the loss changes with respect to this subnetwork and 2 sets of inputs. D\D i x,y L(θ + ϕ backdoor n ; x clean , y clean ) + N i D i x,y L(θ + ϕ backdoor n ; x poison , y poison ) < D\D i x,y L(θ; x clean , y clean ) + N i D i x,y L(θ; x poison , y poison ) (16) Compared to the single-agent attack, the information:capacity ratio per backdoor subnetwork is diluted. We can infer this from Eqt 16 (a multi-agent extension of Eqt 12), where N subnetworks are required to carry information to compute correct predictions for all N backdoored sets compared to 1 in single-attacker scenario. The loss optimization procedure (Eqt 16) determines the selection of subnetworks gradients that should be selected to minimize total loss over the joint dataset. It implicitly determines which backdoored private datasets to ignore with respect to loss optimization, which we reflect in Eqt 17. Given capacity limitations, every combination of backdoor subnetwork gradient is evaluated against every pair of private dataset, and evaluate whether it simultaneously (1) reduces the total loss (Eqt 10), and (2) returns joint loss reduction with respect to any pair of sub-datasets (Eqt 16). We extend update condition ( 16) into a subnetwork gradient set optimization procedure (Eqt 17), where loss optimization computes a set of backdoor subnetwork gradients that can minimize the total loss over as many private datasets. To make a backdoor subnetwork more salient with respect to procedure ( 17), an attacker could (i) increase their individual p n (Lemma 3), (ii) have similar/transferable backdoor patterns and target poison labels as other attackers (or any other form of cooperative behavior). We empirically show this in E4 . With respect to E3 , adversarial perturbations work because they re-use existing subnetworks in θ (i.e. ϕ clean ) without the need to insert a new one. Stylized perturbations can be decomposed into style and content features; the content features may have transferability against unstylized content features thus there may be no subsequent change to ϕ clean , though the insertion of a new ϕ style faces a similar insertion obstacle as ϕ backdoor . Backdoor subnetworks can have varying distances from each other (e.g. depending on how similar the backdoor trigger patterns and corresponding target poison labels are). Measuring the distance between subnetworks would be one way of testing whether a subnetwork carries transferable features for multiple private datasets, as at least in the backdoor setting each candidate subnetwork tends to be mapped to a specific private dataset. Based on E6 , we observe that the parameters diverge as N increase per layer, indicating the low likelihood that at scale a large number of random trigger patterns can share common transferable backdoors. In other words, this supports the notion that each subnetwork is relatively unique for each trigger pattern and share low transferability across a set of private datasets ||ϕ i (x) -ϕ j (x)|| 2 2 > 0.

A.1.7 BOUNDS FOR POISON-RATE-DRIVEN SUBNETWORK INSERTION

Lemma 3. To satisfy condition 14 through an increase in poison and perturbation rate alone, assuming the ratio of the loss differences is 1 (i.e. there is a 1:1 tradeoff where the insertion or removal of the subnetwork will cause the same increase/decrease in loss), then the resulting lower bound is |ε| |x| ≥ 1 2 |D0|> 2|D \ D0| . Proof. If L(θ+ϕ;X backdoor ,Y backdoor )-L(θ;X backdoor ,Y backdoor ) L(θ+ϕ;X clean ,Y clean )-L(θ;X clean ,Y clean ) = 1 , |ε| |x| > |D \ D0|+ |x| |x| |D0| |ε| |x| > |D \ D0|+1 - |ε| |x| |D0| (2 |ε| |x| -1)|D0| > |D \ D0| For the last statement to be true, 2 |ε| |x| -1 must be positive: 2 |ε| |x| -1 ≥ 0 |ε| |x| ≥ 1 2 To obtain the minimum poison rate |D 0 |, we substitute the minimum perturbation rate |ε| |x| = 1 2 such that |D0| > 2|D \ D0| A.1.8 BACKDOOR ATTACK ALGORITHM BadNet Gu et al. (2019a) . Within the given dimensions (length l × width w × channels c) of an input x ∈ X, a single backdoor trigger pattern m replaces pixel values of x inplace. Indices (l, w, c) specify a specific pixel value in a matrix. m is a mask of identical dimensions to x that contains the perturbed pixel values, while z is its corresponding binary mask of 1 at the location of a perturbation and 0 everywhere else, i.e.: z(l, w, c) = 1, if m(l, w, c) > 0 0, if m(l, w, c) = 0 The trigger pattern be of any value, as long as it recurringly exists in a poisoned dataset mapped to a poisoned label. Examples include sparse and semantically-irrelevant perturbations (Eykholt et al., 2018; Guo et al., 2019) , low-frequency semantic features (e.g. mask addition of accessories such as sunglasses (Wenger et al., 2021) , and low-arching or narrow eyes (Stoica et al., 2017) ). The poison rate is the proportion of the private dataset that is backdoored: p = |X poison | |X clean |+|X poison | ⊙ being the element-wise product operator, the BadNet-generated backdoored input is: x poison = x ⊙ (1 -z) + m ⊙ z b : X poison := {x ⊙ (1 -z) + m ⊙ z} x∈X poison Random-BadNet. We implement the baseline backdoor attack algorithm BadNet (Gu et al., 2019b) with the adaptation that, instead of a single square in the corner, we generate randomized pixels such that each attacker has their own specific trigger pattern (and avoid collisions). We verify these random trigger patterns as being functional for single-agent backdoor attacks at N = 1. Many existing backdoor implementations in literature, including the default BadNet implementation, propose a static trigger, such as a square in the corner of an image input. BadNet only requires a poison rate; we additionally introduce the perturbation rate ε, which determines how much of an image to perturb. Extending on BadNet, m i is a randomly-generated trigger pattern, sampled per attacker a i . We make use of seeded numpy.random.choicefoot_0 and numpy.random.uniformfoot_1 functions from the Python numpy library. Perturbation rate ε i dictates the likelihood that an index pixel (l, w, c) will be perturbed, and is used to generate the shape mask. The actual perturbation value is randomly sampled. As the perturbation dimensions are not constrained, a higher ε i results in higher density of perturbations. We compute the shape mask z i , perturbation mask m i , and consequently random-trigger-generated backdoored input as follows: z i = {numpy.random.choice([0, 1], size = l × w, p = [1 -ε i , ε i ]).reshape(l, w)} × c m i (l, w, c) =    numpy.random.uniform(0, 1) × 255, if z(l, w, c) = 1 0, if z(l, w, c) = 0 x poison i = x i ⊙ (1 -z i ) + m i ⊙ z i b : X poison i := {x i ⊙ (1 -z i ) + m i ⊙ z i } xi∈X poison i The distribution of target poison labels may or may not be random. The distribution of clean labels are random, as we randomly sample inputs from the attacker's private dataset to re-assign clean labels to target poison labels. As all our evaluation datasets have 10 classes, this means 1 10 of all backdoored inputs have target poison labels equivalent to clean labels. We tabulate the raw accuracy w.r.t. poisoned labels; a more reflective attack success rate would be (Acc w.r.t. poisoned labels -0.1). Orthogonal-BadNet. We adapt Random-BadNet with orthogonality between N backdoor trigger patterns. Orthogonal trigger patterns should retain high cosine distances, and be far apart from each other in the representation space. We optimize for maximizing cosine distance here for the reason that we suspect a possibility that the randomly-generated trigger patterns may in some cases incur feature collisions (Li et al., 2019) , where we have 2 very similar features but tending towards 2 very different labels; hence, it may be in the interest of attackers to completely minimize this occurrence and generate distinctly different trigger patterns that occupy different regions of the representation space. One form of interpreting the intention of minimizing collisions between features (backdoor trigger patterns) is the intention of minimizing interference between these features; Cheung et al. (2019) introduced a method for continual learning where they would like to store a set of weights without inducing interference between them during training, and hence they generate a set of orthogonal context vectors that transforms the weights for each task such that each resulting matrix would reside in a very distant region of the representation space against each other. We adapt a similar implementation, but applying an orthogonal matrix that transforms the backdoor trigger patterns into residing in a distant region away from the other resultant trigger patterns. First, we generate a base random trigger pattern, the source of information sharing and coordination between the N trigger patterns (unlike Random-BadNet) In-line with Cheung et al. (2019) , where we also use seeded scipy.stats.ortho _ group.rvsfoot_2 from the Python scipy library (l = w), we sample orthogonal matrices from the Haar distribution, multiply it against the original generated trigger pattern (clip values for colour range [0, 255]) to return an orthogonal/distant trigger pattern. o i = {scipy.stats.ortho _ group.rvs(l)} × c b : X poison i := {x i ⊙ (1 -z i ⊙ o i ) + m i ⊙ z i ⊙ o i } xi∈X poison i A.2 EVALUATION DESIGN (EXTENDED) A.2.1 POISON RATE The allocation of the joint dataset that each attacker is expected to contribute is assumed to be identical (only varying on the number of backdoored inputs); so the collective attacker allocation is 1 -V d , and the individual attacker allocation is (1 -V d ) × 1 N . Hence, the real poison rate is calculated as ρ = (1 -V d ) × 1 N × p. We visualize the allocation breakdown in Figure 4 . We acknowledge that a decrease in number of attackers can result in more of the joint dataset available for poisoning, and this can result in a larger absolute number of poisoned samples if the poison rate stays constant. To counter this effect, we take into account the maximum number of attackers we wish to evaluate for an experiment, e.g. N = 1000, such that even as N varies, the real poison rate per attacker stays constant.

A.2.2 (E1) MULTI-AGENT ATTACK SUCCESS RATE (EXTENDED)

In this case, as we wish to test a large number of attackers N = 1000 with small poison rates p = 0.1 for completeness, we set the defender allocation to be small V d = 0.1. This allocation gives sufficient space for 1000 attackers, and we also verify that this extremely poison rate can still manifest a, albeit weakened, backdoor attack at N = 1. The traintime-runtime split of each attacker is 80-20% (80% of the attacker's private dataset is contributed to the joint dataset, 20% reserved for evaluating in run-time). The train-test split for the defender was 80-20% (80% of joint dataset used for training, 20% for validation). We trained a ResNet-18 (He et al., 2015) model with batch size 128 and with early stopping when loss converges (approximately 30 epochs, validation accuracy of 92 -93%; loss convergence depends on pooled dataset structure and number of attackers). We use early-stopping for a large number of epochs, as this training scheme would be reused and ensures consistent loss convergence given varying training datasets (e.g. training a model on augmented dataset with backdoor adversarial training, training a model on stylized perturbations). We use a Stochastic Gradient Descent optimizer with 0.001 learning rate and 0.9 momentum, and cross entropy loss function. We set the seed of pre-requisite libraries as 3407 for all procedures, except procedures that require each attacker to have distinctly different randomly-sampled values (e.g. trigger pattern generation) in which the seed value is the index of the attacker (starting from 0).

A.2.3 (E2) GAME VARIATIONS (EXTENDED)

Datasets We provision 4 datasets, 2 being domain-adapted variants of the other 2. MNIST (10 classes, 60,000 inputs, 1 colour channel) (LeCun & Cortes, 2010) and SVNH (10 classes, 630,420 inputs, 3 colour channels) (Netzer et al., 2011) are a domain pair for digits. CIFAR10 (10 classes, 60,0000 inputs, 3 colour channels) (Krizhevsky, 2009) and STL10 (10 classes, 12,000 inputs, 3 colour channels) (Coates et al., 2011) are a domain pair for objects. We make use of the whole dataset instead of the pre-defined train-test splits provisioned, given that we would like to retain custom train-test splits for defenders and also because we have an additional run-time evaluation set for attackers (Figure 4 ).

Capacity

We trained with the same training procedure as E1 (same splits, optimizers, loss functions) with the variation of the model architecture: SmallCNN (channels [16, 32, 32] ) (Fort et al., 2020) , ResNet-{9, 18, 34, 50, 101, 152} (He et al., 2015) , Wide ResNet-{50, 101}-2 (Zagoruyko & Komodakis, 2016) , VGG 11-layer model (with batch normalization) (Simonyan & Zisserman, 2015) . Due to computational constraints, we wished to sample number of attackers N for the following ranges (1. . . 10, 10. . . 20, 20. . . 100, 100..500, 500. . . 1000), linearly-spaced these ranges into 3 segments, and evaluated on all the returned N . Other than ResNet, we included other architectures including VGG11 (a comparably large capacity model to Wide ResNet-101 in terms of number of parameters but with different architecture) and SmallCNN (a small capacity model).

A.2.4 (E3) ADDITIONAL SHIFT SOURCES (EXTENDED)

Adversarial perturbations. For adversarial perturbations, we use the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) . With this attack method, adversarial perturbation rate ε a = 0.1 can sufficiently bring down the attack success rate comparable and similar to that of ε a = 1.0; we do this to induce variance. Hence, we scale the perturbation against an upper limit 1.0 in our experiments, i.e. ε ′ a = 0.1 × ε a . Adversarial perturbations are only introduced during test-time, and each attacker only crafts adversarial perturbations with respect to their own private dataset (i.e. they train their own surrogate models with the same training scheme as the defender, and do not have any access to the joint dataset to craft perturbations). It is also worth noting that FGSM computes perturbations with respect to the gradients of the attacker's surrogate model where this model was trained on the attacker's private dataset, which contains backdoored inputs mapped to poisoned labels, meaning the feature representation space is perturbed with respect to backdoor trigger patterns. We do not train a surrogate model with respect to the clean private dataset, as the intention of a surrogate model is to approximate the target defender's model which has been assumed to be poisoned, and it is also in the attacker's best interests to introduce adversarial perturbations even with respect to the backdoor perturbations, as long as a misclassification occurs (which we can verify with the clean-label accuracy). Stylized perturbations. For stylistic perturbations, we use the Adaptive Instance Normalization (AdaIN) stylization method (Huang & Belongie, 2017) , which is a standard method to stylize datasets such as stylized-ImageNet (Geirhos et al., 2019) . Dataset stylization is considered as texture shift or domain shift in different literature. We randomly sample a distinct (non-repeating) style for each attacker. α is the degree of stylization to apply; 1.0 means 100% stylization, 0% means no stylization. We follow the implementation in Huang & Belongie (2017) and Geirhos et al. (2019) and stylize CIFAR-10 with the Paintings by Numbers style dataset. We adapt the method for our attack, by first randomly sampling a distinct set of styles for each attacker, and stylizing each attacker's sub-dataset before the insertion of backdoor or adversarial perturbations. This shift also contributes to the realistic scenario that different agents may have shifted datasets given heterogenous sources.

Train-time

When the poison rate is 0.0, then the accuracy w.r.t. poisoned labels is equivalent to the accuracy w.r.t. clean labels. In our results, target poisoned labels are the intended labels based on attacker preferences, being clean labels for clean untriggered inputs and poisoned labels for backdoor triggered inputs. This means that when the run-time poison rate is 0.0, then the accuracy w.r.t. poison and clean labels are identical (unfiltered values in Figure 9 ). It also means that when ε, p = 0.0, then these values would be identical for accuracy w.r.t. poison as well as clean labels. Prior to interpreting the results, we need to consider the attack objective of the attacker. For a backdoor attack, an attacker's objective is to maximize the accuracy w.r.t. poisoned labels. In an adversarial attack, an attacker's objective is to minimize the accuracy w.r.t. clean labels. The attack objectives may have additional conditions in literature, such as imperceptability to humans, or retaining high accuracy on clean inputs, etc, but the aforementioned 2 are the primary goals. They may not necessarily be contradictory either for two reasons: (i) if the poisoned label is non-identical to the clean label, then both a backdoor attack and adversarial attack will succeed in rendering a misclassification w.r.t. clean labels; and (ii) one similarity between a backdoor attack and adversarial attack is that they both rely on varying fidelity of information of the train-time distributions, where the backdoor attack has white-box knowledge of the perturbations that will cross the decision boundary to a target class, while the adversarial attack has grey/black-box knowledge of perturbations that may have a likelihood of crossing the decision boundary to a target class. In any case, we conclude for this evaluation that the attack objective of the attacker is to minimize the accuracy w.r.t. clean labels. One of our suspicions regarding a low backdoor attack success rate is whether the generation of adversarial perturbations may possibly de-perturb backdoor perturbations: we visually inspect beforeattack and after-attack images to verify that both adversarial and backdoor perturbations are retained in a multi-agent backdoor setting, and we sample a set in Figure 5 . With regards to ε b , p = 1.0 having such high ASR, it may not just be cause of a strong trigger pattern/saliency, but also it should be noted there is 100% class imbalance in this case for the attacker's surrogate model (only 1 class in the training set, hence no decision boundaries to cross; adversarial ASR should be 0.0). Given that our intention for this experiment is to observe the effect of a joint distribution shift between these two attacks unmodified in procedure and aligned as much as possible to their original attack design, we did not construct a coordinated adversarial-backdoor attack where only adversarial perturbations that do not counter, or even reinforce, the backdoor perturbations / poison labels are crafted. A.2.5 (E4) COOPERATION OF AGENTS (EXTENDED) 3 and 4 , to determine that the approximate Nash Equilibrium is (20.3, 79.7) when a tends to use random triggers and d uses a defense. In this setting, we primarily study 2 variables of cooperation, which are also the set of actions that an attacker can take: (i) input poison parameters (p, ε, and distance between different attackers' backdoor trigger patterns) , and (ii) target poison label selection. In addition to these 2 attacker actions that will formulate a set of strategies, we wish to evaluate the robustness of these strategies by (i) testing the scalability of the strategy at very large attacker counts, and (ii) testing the robustness of the strategy by introducing the weakest single-agent backdoor defense. We establish information sharing as the proce- With these strategies, we would like to observe the following agent dynamicsdriven phenomenon, specifically the outcomes from attackers exercising various extents of selfishness (escalating attack parameters) against extents of collective goodwill (coordinating attack parameters): (i) the outcome from the escalation of ε; (ii) the outcome from a gradual coordination of target labels; (iii) the outcome of coordinating trigger pattern generation. Escalation. Here, we describe how we demonstrate selfish escalation or collective coordination. While in other experiments, we scale the effect of random selection to a large number of attackers to approximate non-cooperative behaviour, we would now like to simulate simplified cases of anti-cooperative and cooperative strategies. For the selfish escalation of trigger patterns, we suppose each attacker crafts their backdoor trigger patterns independently from each other, and when information of the presence of other attackers is known or each attacker wishes to raise certainty of a backdoor attack, we consider the case where attackers to escalate their ε from 0.15 to 0.55 to 0.95. To study trigger label collision cases and coordinated target label selection, we show a gradual change in trigger label selection amongst attackers, where they start off each having independent labels, then there is some trigger label collision between 40% of the attackers (40% attackers sharing the same label), then there is 100% trigger label collision between all of the attackers (i.e. all attackers share the same label). Coordination can manifest as either attackers each choosing distinctly different labels, or attackers all choosing the same label. To monitor collision between trigger patterns and trigger labels, we compute the cosine distance between each attacker's trigger pattern against that of Attacker Agent 1. The trigger patterns are randomly generated; first by computing a random set of pixel positional indices within image dimensions (pixel positions to be perturbed), which we refer to as shape and show its corresponding cosine distance; then by computing the colour value change for each pixel position in the shape, which we refer this final trigger pattern of both positions and perturbations values as shape+colour. We assume there is no coordination hence the choice of random perturbations (as opposed to a perturbation-optimization function of minimizing cosine distance), though as ε increases, we note that the cosine distance for both shape and shape+colour decrease (as the density of perturbations would be expected to be higher as ε increases); this provides us with a range of trigger pattern distances in the representation space to evaluate against trigger label selection. In Table 3 and 4 , for agents without the escalation in overlap of target label 4 (in red), we only redistribute the other 4 labels out equally, but in Figures 6 we redistribute 10 labels out randomly.

A.2.6 (E5) PERFORMANCE AGAINST DEFENSES (EXTENDED)

We set the defender allocation to be significantly higher than that of other experiments because some of the defenses require subsets from the defender's private dataset to sample from, and enlarging this allows us to test the single-agent defenses leniently for the defenders and harshly for the attackers. This enlargement of defender allocation would also mean we should be careful when comparing values between experiments; for example, the real poison rate in this experiment is 0.0011(66), which has 12 more poisoned samples than p = 1.0 in E1 , attributed to the large difference in N considered. Extending on stealth and imperceptability, an important aspect of the backdoor attack, there is a further sub-subclassification of backdoor attacks into dirty- (Gu et al., 2019a) and clean-label (Shafahi et al., 2018a; Zhu et al., 2019a) backdoor attacks. In dirty-label backdoor attacks, the true label of triggered inputs does not match the label assigned by the attacker (i.e., the sample would appear incorrectly labeled to a human). In clean-label backdoor attacks, the true label of triggered inputs matches the label assigned by the attacker. The 2 sub-classes can be executed with the same attack algorithm and follow the same underlying principles, with a change to the target trigger label, though variant algorithms for the clean-label algorithm also exist (Shafahi et al., 2018a; Zhu et al., 2019a) . Clean Label Backdoor Attack. Hence, in addition to BadNet, we also evaluate defenses on the Clean Label Backdoor Attack (Turner et al., 2019) . Also a common baseline backdoor attack algorithm, the main idea of their method is to perturb the poisoned samples such that the learning of salient characteristic of the input more difficult, hence causing the model to rely more heavily on the backdoor pattern in order to successfully perform label classification. It utilizes adversarial examples or GAN-generated data, such that the resulting poisoned inputs appear to be consistent with the clean labels and thus seem benign even upon human inspection. The objective of a targeted clean label poisoning attack (Shafahi et al., 2018a) (which also applies to a clean label backdoor attack), is to introduce backdoor perturbations to a set of inputs during train-time whose poisoned labels are equal to their original clean labels, but the usage of the backdoor pattern during run-time regardless of ground-truth class would return the poisoned label. We align our implementation with that of Turner et al. (2019) , and use projected gradient descent (PGD) (Madry et al., 2018) to insert backdoor perturbations ε. A lower accuracy w.r.t. poisoned labels infers a better defense. While a post-defense accuracy below 0.1 is indicative of mislabelling poisoned samples whose ground-truth clean labels were also poisoned labels, it is also at least indicative of de-salienating the backdoor trigger perturbation, and hence indicative of backdoor robustness. Defenses. We evaluate 2 augmentative (data augmentation, backdoor adversarial training) and 2 removal (spectral signatures, activation clustering) defenses. For augmentative defenses, 50% of the defender's allocation of the dataset is assigned to augmentation: for V d = 0.8, 0.4 is clean, 0.4 is augmented. We devised backdoor defenses based on the backfiring effect, such as agent augmentation and agent indexing, in Datta et al. (2021) . To mitigate the low accuracy w.r.t. clean labels in the presence of a backdoor trigger, we devised a backdoor defense through the construction of compressed low-loss subspaces in Datta & Shadbolt (2022a) . • No Defense: We retain identical defender model training conditions to that in E1 . The defender allocation 0.8 would be unmodified during model training in this setting. • Data Augmentation: Recent evidence suggests that using strong data augmentation techniques (Borgnia et al., 2021 ) (e.g., CutMix (Yun et al., 2019) or MixUp (Zhang et al., 2018) ) leads to a reduced backdoor attack success rate. We implement CutMix (Yun et al., 2019) , where augmentation takes place per batch, and training completes in accordance with aforementioned early stopping. • Backdoor Adversarial Training: Geiping et al. (2021) extend the concept of adversarial training on defender-generated backdoor examples to insert their own triggers to existing labels. We implement backdoor adversarial training (Geiping et al., 2021) , where the generation of backdoor perturbations is through BadNet (Gu et al., 2019a) , where 50% of the defender's allocation of the dataset is assigned to backdoor perturbation, p, ε = 0.4, and 20 different backdoor triggers used (i.e. allocation of defender's dataset for each backdoor trigger pattern is (1 -0.5) × 0.8 × 1 20 ). • Spectral Signatures: Spectral Signatures (Tran et al., 2018) is an input inspection method used to perform subset removal from a training dataset. For each class in the backdoored dataset, the method uses the singular value decomposition of the covariance matix of the learned representation for each input in a class in order to compute an outlier score, and remove the top scores before re-training. In-line with existing implementations, we remove the top 5 scores for N = 1 attackers. For N = 100, we scale this value accordingly and remove the top 500 scores. • Activation Clustering: Activation Clustering (Chen et al., 2018) is also an input inspection method used to perform subset removal from a training dataset. In-line with Chen et al. (2018)'s implementation, we perform dimensionality reduction using Independent Component Analysis (ICA) on the dataset activations, then use k-means clustering to separate the activations into two clusters, then use Exclusionary Reclassification to score and assess whether a given cluster corresponds to backdoored data and remove it.

A.2.7 (E6) MODEL PARAMETERS INSPECTION (EXTENDED)

Other than ε, p = 0.55 and training SmallCNN, we retain the same attacker/defender configurations as E1 . We adopt SmallCNN for its low parameter count, which would be helpful in simplifying the subnetwork generation and analysis, as well as overall interpretability. In addition, as we show that the backdoor attack is consistent across all model architectures (E2) , using a smaller model means the subnetworks with respect to the complete network is less diluted and larger in proportion (and hence the cosine distance would be less diluted for our observation). Lottery ticket. We start with a fixed random initialization, shared across the 4 models trained on N = 1, 10, 100, 1000. We use Frankle & Carbin (2019b)'s Iterative Magnitude Pruning (IMP) procedure to generate a pruned DNN (0.8% in size of the full DNN across all N s), also denoted as the lottery ticket. The lottery ticket is a subnetwork, specifically the set of nodes in the full DNN that are sufficient for inference at a similar performance as the unpruned DNN. The study of the lottery ticket helps us make some inferences with respect to how the feature space changes, as well as where the optima on the loss landscape has deviated. We compute the cosine distance per layer between (i) parameters in the full DNN, (ii) mask of the lottery ticket against the DNN, (iii) parameters of the lottery ticket. We plot the values of N v.s. N for the weights and biases matrices for each convolutional layer (conv2d{layer _ index}) and fully-connected layer (fc{layer _ index}). All in all, the lottery ticket is a proxy for the most salient features, hence also acts as an alternate feature space representation. We compute the full network distance, as we wish to decompose the distance changes of the subnetwork across N with respect to both new optima w.r.t. change in N in addition to changes to the DNN w.r.t. the introduction of new trigger patterns. The mask is a one-zero positional matrix; rather than removing the zeroes, we compute the distance including zeros to retain the original dimensionality of the full network, and measure the distance with factoring in the position of the values. To retain the position of parameter values, we multiply the one-zero mask against the full network parameters. While the size (new parameters count) of the pruned network can vary, we set the threshold to stop pruning at 97%, where pruning stops after we maximize accuracy for 97% pruned weights. Specifically, the lottery tickets across all N s are 0.8% in size of the full DNN; in other words, we pruned the count from 15,722 to 126. Though we may expect that if a DNN were allowed to store as many subnetworks per triggers as possible if we let the pruning threshold to be variable, by setting the capacity to be fixed (in-line with our theoretical analysis), we let the optimization steps manifest the loss function tradeoff discussed earlier, and manifest the acceptance/rejection of backdoor subnetwork insertion for fixed lottery ticket generation. 



https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ortho_group.html



Figure 1: Representation of the multiparty backdoor attack. Attackers generate unique backdoor trigger patterns with target poison labels and contribute to a joint dataset for the defender to construct a model. Attacker's Parameters: Each attacker is a player {a i } i∈N that generates backdoored inputs X poison to insert into their private dataset contribution {X poison ∈ D i } i∈N ∈ D. Each attacker only has information with respect to their own private dataset source (including inputs, domain/style, class/labels), and backdoor trigger algorithm. Attackers use backdoor attack algorithm b i (Appendix A.1.8), which accepts a set of inputs mapped to target poisoned labels {X i : Y poison i } ∈ D i to specify the intended label classification, backdoor perturbation rate ε i to specify the proportion of an input to be perturbed, and the poison rate p i = |X poison | |X clean |+|X poison | to specify the proportion of the private dataset to contain backdoored inputs, to return X poison = b i (X i , Y poison i

be sampled clean and backdoored observations from their respective distributions. Let Rand : s ∼ U(S) s.t.P(s) = 1|S| denote a random distribution where an observation s is uniformly sampled from (discrete) set S. If N → ∞, then it follows that predicted label y *

Figure 3: Performance against Defenses: BadNet and Clean-Label attacks against augmentative and removal defenses.

Naseer et al. (2019);Datta (2021);Qi et al. (2021a)  show worsened model performance after applying adversarial perturbations upon domain-shifted inputs.Ganin et al. (2016) proposed a domain-adapted adversarial training scheme to improve domain adaptation performance.Geirhos et al. (2019) also show that the use of stylized perturbations with AdaIN as an augmentation procedure can improve performance on an adversarial perturbation dataset ImageNet-C. AdvTrojan(Liu et al., 2021)  combines adversarial perturbations together with backdoor trigger perturbations to craft stealthy triggers to perform backdoor attacks. Weng et al. (2020) studies the trade-off between adversarial defenses optimized towards adversarial perturbations against backdoor defenses optimized towards backdoor perturbations.Santurkar et al. (2020) synthesize distribution shifts by combining random noise, adversarial perturbations, and domain shifts to varying levels to contribute subpopulation shift benchmarks.Rusak et al. (2020) proposed a robustness measure by augmenting a dataset with both adversarial noise and stylized perturbations, by evaluating a set of perturbation types including Gaussian noise, stylization and adversarial perturbations. A.1.4 BACKFIRING EFFECT: CHANGES IN DISTRIBUTION OF ε

be sampled clean and backdoored observations from their respective distributions. Let Rand : s ∼ U(S) s.t.P(s) = 1|S| denote a random distribution where an observation s is uniformly sampled from (discrete) set S. If N → ∞, then it follows that predicted label y * = f (x + ε; θ) ∼ U(Y) s.t. P(y * ) = 1 |Y| . Proof sketch of Theorem 1. With multiple attackers, we sample clean observations x, y ∼

+ ϕ backdoor ; x clean , y clean ) -L(θ; x clean , y clean ) < x poison , y poison ) -L(θ + ϕ backdoor ; x poison , y poison ) (12) + ϕ backdoor ; x clean , y clean ) -L(θ; x clean , y clean ) < x poison , y poison ) -L(θ + ϕ backdoor ; x poison , y poison ) x poison , y poison ) -L(θ + ϕ backdoor ; x poison , y poison ) ; x poison , y poison ) -L(θ + ϕ backdoor ; x poison , y poison ) + ϕ backdoor ; x clean , y clean ) -L(θ; x clean , y clean ) (14)

sign(∂L(x, y; θ + ϕ) ∂θ) + sign( ∂L(ε, y; θ + ϕ) ∂θ ) ≡ -2(15)Multi-Agent Backdoor Attack (N>1). The addition of each backdoor attackers results in a corresponding subnetwork gradient, formulating θ = |D| c ϕ c + N n ϕ bn for N attackers. The bound in (N=2: Case 1) persists for N > 1; in this analysis, we extend on (N=2: Case 2). The cumulative poison rate p is composed of the poison rates for each attacker p =

Figure 4: Summary of defender and attacker allocations in train-time and run-time. Train-time is the period where only training data is processed, including the defender's train set, defender's validation set, and attacker's train-time set. Test-time is the period where only test data is processed, being only attacker run-time set in this setup.

Figure 5: Comparison of images before and after the insertion of adversarial perturbations. Backdoor trigger perturbations still exist, and the adversarial perturbations exist in other regions of the image.

Figure6: Cooperation of agents (N = 100): (left) Coordination in terms of proportion of attackers selecting poison label 4, while others randomly select; (right) Escalation in terms of proportion of attackers increasing p from 0.15 to 0.55, while others retain 0.15, then subsequently the increase in p from 0.55 to 0.95, while others retain their previously-escalated 0.55; these escalation cases are plotted against corresponding distributions of accuracy w.r.t. poisoned labels (top row) and clean labels (bottom row).

Figure 7: Additional shift sources: Joint distribution shift of varying counts: indexing columns from left to right, column 0 (Multiple {backdoor} perturbations; shifts=1), column 1 (Multiple {backdoor, adversarial} perturbations; shifts=2), column 2 (Multiple {backdoor, stylized} perturbations; shifts=2), column 3 (Multiple {backdoor, adversarial, stylized} perturbations; shifts=3).

Table values out of 100.0. (E1) Multi-Agent Attack Success Rate In this section, we investigate the research question: what effect on attack success rate does the inclusion of an additional attacker make? The base experimental configurations (unless otherwise specified) are listed here and Appendix A.2. Results are in Figure 2. (E2) Game variations In this section, we investigate: do changes in game setup (action-independent variables) manifest different effects in the multi-agent backdoor attack?

and Figure 9.

Capacity variations: Run-time accuracy w.r.t. poisoned labels against N for different models (ratio of number of parameters taken against ResNet-18).

Cooperation of agents: Backdoor trigger patterns generated with Random-BadNet.

Cooperation of agents: Backdoor trigger patterns generated with Orthogonal-BadNet.

Expected values of each strategy from Tables

Expt.

V dure in which agents send information between each other, and the collective information garnered can return outcomes in the range of anti-cooperative (which we denote as agents using information that hinders the other agent's individual payoff) to cooperative (which we denote as agents using information to maximize collective payoff). The payoff functions for anti-cooperative and non-cooperative strategies (i.e. individual ASR) are the same. We evaluate on 5 classes: 0 (airplane), 2 (bird), 4 (deer), 6 (frog), 8 (ship); we retain the same proportions of each of these 5 classes as N varies. For N = 100, we specify the number of attackers N {Y } that target class Y . 

