ADVERSARIAL ATTACKS ON ADVERSARIAL BANDITS

Abstract

We study a security threat to adversarial multi-armed bandits, in which an attacker perturbs the loss or reward signal to control the behavior of the victim bandit player. We show that the attacker is able to mislead any no-regret adversarial bandit algorithm into selecting a suboptimal target arm in every but sublinear (T o(T )) number of rounds, while incurring only sublinear (o(T )) cumulative attack cost. This result implies critical security concern in real-world bandit-based systems, e.g., in online recommendation, an attacker might be able to hijack the recommender system and promote a desired product. Our proposed attack algorithms require knowledge of only the regret rate, thus are agnostic to the concrete bandit algorithm employed by the victim player. We also derived a theoretical lower bound on the cumulative attack cost that any victim-agnostic attack algorithm must incur. The lower bound matches the upper bound achieved by our attack, which shows that our attack is asymptotically optimal.

1. INTRODUCTION

Multi-armed bandit presents a sequential learning framework that enjoys applications in a wide range of real-world domains, including medical treatment Zhou et al. (2019) ; Kuleshov & Precup (2014) , online advertisement Li et al. (2010) , resource allocation Feki & Capdevielle (2011) ; Whittle (1980) , search engines Radlinski et al. (2008) , etc. In bandit-based applications, the learning agent (bandit player) often receives reward or loss signals generated through real-time interactions with users. For example, in search engine, the user reward can be clicks, dwelling time, or direct feedbacks on the displayed website. The user-generated loss or reward signals will then be collected by the learner to update the bandit policy. One security caveat in user-generated rewards is that there can be malicious users who generate adversarial reward signals. For instance, in online recommendation, adversarial customers can write fake product reviews to mislead the system into making wrong recommendations. In search engine, cyber-attackers can create click fraud through malware and causes the search engine to display undesired websites. In such cases, the malicious users influence the behavior of the underlying bandit algorithm by generating adversarial reward data. Motivated by that, there has been a surge of interest in understanding potential security issues in multi-armed bandits, i.e., to what extend are multi-armed bandit algorithms susceptible to adversarial user data. Prior works mostly focused on studying reward attacks in the stochastic multi-armed bandit setting Jun et al. (2018) ; Liu & Shroff (2019) , where the rewards are sampled according to some distribution. In contrast, less is known about the vulnerability of adversarial bandits, a more general bandit framework that relaxes the statistical assumption on the rewards and allows arbitrary (but bounded) reward signals. The adversarial bandit also has seen applications in a broad class of real-world problems especially when the reward structure is too complex to model with a distribution, such as inventory control Even-Dar et al. (2009) and shortest path routing Neu et al. (2012) . Similarly, the same security problem could arise in adversarial bandits due to malicious users. Therefore, it is imperative to investigate potential security caveats in adversarial bandits, which provides insights to help design more robust adversarial bandit algorithms and applications. In this paper, we take a step towards studying reward attacks on adversarial multi-armed bandit algorithms. We assume the attacker has the ability to perturb the reward signal, with the goal of misleading the bandit algorithm to always select a target (sub-optimal) arm desired by the attacker. Our main contributions are summarized as below. (1) We present attack algorithms that can successfully force arbitrary no-regret adversarial bandit algorithms into selecting any target arm in T o(T ) rounds while incurring only o(T ) cumulative attack cost, where T is the total rounds of bandit play. (2) We show that our attack algorithm is theoretically optimal among all possible victim-agnostic attack algorithms, which means that no other attack algorithms can successfully force T o(T ) target arm selections with a smaller cumulative attack cost than our attacks while being agnostic to the underlying victim bandit algorithm. (3) We empirically show that our proposed attack algorithms are efficient on both vanilla and a robust version of Exp3 algorithm Yang et al. (2020) .

2. PRELIMINARIES

The bandit player has a finite action space A = {1, 2, ..., K}, where K is the total number of arms. There is a fixed time horizon T . In each time step t 2 [T ], the player chooses an arm a t 2 A, and then receives loss `t = L t (a t ) from the environment, where L t is the loss function at time t. In this paper, we consider "loss" instead of reward, which is more standard in adversarial bandits. However, all of our results would also apply in the reward setting. Without loss of generality, we assume the loss functions are bounded: L t (a) 2 [0, 1], 8a, t. Moreover, we consider the so-called non-adaptive environment Slivkins (2019); Bubeck & Cesa-Bianchi (2012) , which means the loss functions L 1:T are fixed beforehand and cannot change adaptively based on the player behavior after the bandit play starts. The goal of the bandit player is to minimize the difference between the cumulative loss incurred by always selecting the optimal arm in hindsight and the cumulative loss incurred by the bandit algorithm, which is defined as the regret below. Definition 2.1. (Regret). The regret of the bandit player is R T = E " T X t=1 L t (a t ) # min a T X t=1 L t (a), where the expectation is with respect to the randomness in the selected arms a 1:T . We now make the following major assumption on the bandit algorithm throughout the paper. Assumption 2.2. (No-regret Bandit Algorithm). We assume the adversarial bandit algorithm satisfies the "no-regret" property asymptotically, i.e., R T = O(T ↵ ) for some ↵ 2 [ 1 2 , 1) 1 . As an example, the classic adversarial bandit algorithm Exp3 achieves ↵ = 1 2 . In later sections, we will propose attack algorithms that apply not only to Exp3, but also arbitrary no-regret bandit algorithms with regret rate ↵ 2 [ 1 2 , 1). Note that the original loss functions L 1:T in (1) could as well be designed by an adversary, which we refer to as the "environmental adversary". In typical regret analysis of adversarial bandits, it is implicitly assumed that the environmental adversary aims at inducing large regret on the player. To counter the environmental adversary, algorithms like Exp3 introduce randomness into the arm selection policy, which provably guarantees sublinear regret for arbitrary sequence of adversarial loss functions L 1:T .

2.1. MOTIVATION OF ATTACKS ON ADVERSARIAL BANDITS

In many bandit-based applications, an adversary may have an incentive to pursue different attack goals than boosting the regret of the bandit player. For example, in online recommendation, imagine the situation that there are two products, and both products can produce the maximum click-through rate. We anticipate a fair recommender system to treat these two products equally and display them with equal probability. However, the seller of the first product might want to mislead the recommender system to break the tie and recommend his product as often as possible, which will benefit him most. Note that even if the recommender system chooses to display the first product every time, the click-through rate (i.e., reward) of the system will not be compromised because the first product has the maximum click-through rate by assumption, thus there is no regret in always recommending it. In this case, misleading the bandit player to always select a target arm does not boost the regret. We point out that in stochastic bandits, forcing the bandit player to always select a sub-optimal target arm must induce linear regret. Therefore, a robust stochastic bandit algorithm that recovers sublinear regret in presence of an attacker can prevent a sub-optimal arm from being played frequently. However, in adversarial bandit, the situation is fundamentally different. As illustrated in example 1, always selecting a sub-optimal target arm may still incur sublinear regret. As a result, robust adversarial bandit algorithms that recover sublinear regret in presence of an adversary (e.g., Yang et al. (2020) ) can still suffer from an attacker who aims at promoting a target arm. Example 1. Assume there are K = 2 arms a 1 and a 2 , and the loss functions are as below. 8t, L t (a) = ⇢ 1 p T /T if a = a 1 , 1 if a = a 2 . (2) Note that a 1 is the best-in-hindsight arm, but always selecting a 2 induces p T regret, which is sublinear and does not contradict the regret guarantee of common bandit algorithms like Exp3.

3. THE ATTACK PROBLEM FORMULATION

While the original loss functions L 1:T can already be adversarial, an adversary who desires a target arm often does not have direct control over the environmental loss functions L 1:T due to limited power. However, the adversary might be able to perturb the instantiated loss value `t slightly. For instance, a seller cannot directly control the preference of customers over different products, but he can promote his own product by giving out coupons. To model this attack scenario, we introduce another adversary called the "attacker", an entity who sits in between the environment and the bandit player and intervenes with the learning procedure. We now formally define the attacker in detail. (Attacker Knowledge). We consider an (almost) black-box attacker who has very little knowledge of the task and the victim bandit player. In particular, the attacker does not know the clean environmental loss functions L 1:T beforehand. Furthermore, the attacker does not know the concrete bandit algorithm used by the player. However, the attacker knows the regret rate ↵foot_0 . (Attacker Ability) In each time step t, the bandit player selects an arm a t and the environment generates loss `t = L t (a t ). The attacker sees a t and `t. Before the player observes the loss, the attacker has the ability to perturb the original loss `t to ˜t. The player then observes the perturbed loss ˜t instead of the original loss `t. The attacker, however, cannot arbitrarily change the loss value. In particular, the perturbed loss ˜t must also be bounded: ˜t 2 [0, 1], 8t. (Attacker Goal). The goal of the attacker is two-fold. First, the attacker has a desired target arm a † , which can be some sub-optimal arm. The attacker hopes to mislead the player into selecting a † as often as possible, i.e., maximize N T (a † ) = P T t=1 1 ⇥ a t = a † ⇤ . On the other hand, every time the attacker perturbs the loss `t, an attack cost c t = | ˜t `t| is induced. The attacker thus hopes to achieve a small cumulative attack cost over time, defined as below. Definition 3.1. (Cumulative Attack Cost). The cumulative attack cost of the attacker is defined as C T = T X t=1 c t , where c t = | ˜t `t|. (3) The focus of our paper is to design efficient attack algorithms that can achieve E ⇥ N T (a † ) ⇤ = T o(T ) and E [C T ] = o(T ) while being agnostic to the concrete victim bandit algorithms. Intuitively, if the total loss of the target arm P T t=1 L t (a † ) is small, then the attack goals would be easy to achieve. In the extreme case, if L t (a † ) = 0, 8t, then even without attack, a † is already the optimal arm and will be selected frequently in most scenariosfoot_1 . On the other hand, if L t (a † ) = 1, 8t, then the target arm is always the worst arm, and forcing the bandit player to frequently select a † will require the attacker to significantly reduce L t (a † ). In later sections, we will formalize this intuition and characterize the attack difficulty.

4. ATTACK WITH TEMPLATE LOSS FUNCTIONS

In this section, we first propose a general attack strategy called "template-based attacks". The template-based attacks perform loss perturbations according to a sequence of template loss functions L1:T . The templates L1:T are determined before the bandit play starts. Then in each time step t during the bandit play, the attacker perturbs the original loss `t to ˜t = Lt (a t ). Note that template-based attacks may seem weak at first glance, because the template loss functions are fixed beforehand and thus non-adaptive to the behaviors of the victim bandit player. This is in stark contrast to most prior works such as Jun et al. (2018) . However, as we will show in later sections, template-based attacks are efficient and can even achieve the optimal attack cost. We first make the following important observation, which is a critical property used to prove the main theoretical results in our paper. Observation 1. (Equivalence of Attack) Due to the partial observability of loss functions in the multi-armed bandit framework, running any bandit algorithm in the original environment L 1:T with template-based attack L1:T , is equivalent to, running the same algorithm in an environment with loss functions L1:T . In particular, the standard regret guarantee R T = O(T ↵ ) holds with respect to the template loss functions L1:T . We next instantiate the template-based attack on an easy attack scenario and general attack scenarios. For each scenario, we will design the template loss functions L1:T ,

4.1. AN EASY ATTACK SCENARIO

In this section, we first study a relatively easy attack scenario, which serves as a motivating example that illustrates two key principles behind the design of our attack algorithms: (1) Try to keep the loss of the target arm L t (a † ) unchanged; and (2) Boost the loss of all the other non-target arms to the maximum. The easy attack scenario has the following assumption on the original loss functions, which requires the loss of the target arm to be at least ⇢ bounded away from the maximum value. Assumption 4.1. (Easy Attack Scenario). There exists some constant ⇢ 2 (0, 1] such that L t (a † ) 2 [0, 1 ⇢], 8t 2 [T ]. The boundedness condition (4) needs to hold over all T rounds. If assumption 4.1 holds, then the attacker can design the template loss functions Lt as in (5) to perform attack. 8t, Lt (a) = ⇢ L t (a) if a = a † , 1 otherwise. ( ) Remark 4.2. A few remarks are in order. First, note that although the form of Lt (a) depends on L t (a), the attacker does not require knowledge of the original loss functions L 1:T beforehand to implement the attack. This is because when a t = a † , the perturbed loss is ˜t = Lt (a t ) = L t (a † ) = `t while `t is observable. When a t 6 = a † , ˜t can be directly set to 1. Second, note that the target arm a † becomes the best-in-hindsight arm after attack. Consider running a no-regret bandit algorithm on the perturbed loss L1:T , since Lt (a † ) = L t (a † )  1 ⇢, every time the player selects a non-target arm a t 6 = a † , it will incur at least ⇢ regret. However, the player is guaranteed to achieve sublinear regret on L1:T by observation 1, thus non-target arms can at most be selected in sublinear rounds. Finally, note that the loss remains unchanged when the target arm a † is selected. This design is critical because should the attack be successful, then a † will be selected in T o(T ) rounds. By keeping the loss of the target arm unchanged, the attacker does not incur attack cost when the target arm is selected. As a result, our design (5) induces sublinear cumulative attack cost. Theorem 4.3. Assume assumption 4.1 holds, and the attacker applies (5) to perform attack. Then there exists a constant M > 0 such that the expected number of target arm selections satisfies effective and efficient if the victim bandit algorithm has a better regret rate. The constant M comes from the regret bound of the victim adversarial bandit algorithm and will depend on the number of arms K (similarly for Theorem 4.6 and 4.9). We do not spell out its concrete form here because our paper aims at designing general attacks against arbitrary adversarial bandit algorithms that satisfy assumption 2.2. The constant term in the regret bound may take different forms for different algorithms. Comparatively, the sublinear regret rate ↵ is more important for attack considerations. E ⇥ N T (a † ) ⇤ T MT ↵ /⇢,

4.2. GENERAL ATTACK SCENARIOS

Our analysis in the easy attack scenario relies on the fact that every time the player fails to select the target arm a † , at least a constant regret ⇢ will be incurred. Therefore, the player can only take non-target arms sublinear times. However, this condition breaks if there exists time steps t where L t (a † ) = 1. In this section, we propose a more generic attack strategy, which provably achieves sublinear cumulative attack cost on any loss functions L 1:T . Furthermore, the proposed attack strategy can recover the result of Theorem 4.3 (up to a constant) when it is applied in the easy attack scenario. Specifically, the attacker designs the template loss functions Lt as in ( 6) to perform attack. 8t, Lt (a) = ⇢ min{1 t ↵+✏ 1 , L t (a)} if a = a † , 1 otherwise, ( ) where ✏ 2 [0, 1 ↵) is a free parameter chosen by the attacker. We discuss how the parameter ✏ affects the attack performance in remark 4.7. Remark 4.5. Similar to (5), the attacker does not require knowledge of the original loss functions L 1:T beforehand to implement the attack. When a non-target arm is selected, the attacker always increases the loss to the maximum value 1. On the other hand, when the target arm a † is selected, then if the observed clean loss value `t = L t (a t ) > 1 t ↵+✏ 1 , the attacker reduces the loss to 1 t ↵+✏ 1 . Otherwise, the attacker keeps the loss unchanged. In doing so, the attacker ensures that the loss of the target arm Lt (a † ) is at least t ↵+✏ 1 smaller than Lt (a) for any non-target arm a 6 = a † . As a result, a † becomes the best-in-hindsight arm under L1:T . Note that the gap t ↵+✏ 1 diminishes as a function of t since ✏ < 1 ↵. The condition that ✏ must be strictly smaller than 1 ↵ is important to achieving sublinear attack cost, which we will prove later. Theorem 4.6. Assume the attacker applies (6) to perform attack. Then there exists a constant M > 0 such that the expected number of target arm selections satisfies E ⇥ N T (a † ) ⇤ T 1 ↵ + ✏ T 1 ↵ ✏ MT 1 ✏ , ( ) and the expected cumulative attack cost satisfies E [C T ]  1 ↵ + ✏ T 1 ↵ ✏ + MT 1 ✏ + 1 ↵ + ✏ T ↵+✏ . ( ) Remark 4.7. According to (7), the target arm will be selected more frequently as ✏ grows. This is because the attack (6) enforces that the loss of the target arm Lt (a † ) is at least t ↵+✏ 1 smaller than the loss of non-target arms. As ✏ increases, the gap becomes larger, thus the bandit algorithm would further prefer a † . The cumulative attack cost, however, does not decrease monotonically as a function of ✏. This is because while larger ✏ results in more frequent target arm selections, the per-round attack cost may also increase. For example, if L t (a † ) = 1, 8t, then whenever a † is selected, the attacker incurs attack cost t ↵+✏ 1 , which grows as ✏ increases. Corollary 4.8. Assume the attacker applies (6) to perform attack. Then when the attacker chooses ✏ = 1 ↵ 2 , the expected cumulative attack cost achieves the minimum value asymptotically. Correspondingly, we have E ⇥ N T (a † ) ⇤ = T O(T 1+↵ 2 ) and E [C T ] = O(T 2 ). We now show that our attack (6) recovers the results in Theorem 4.3 when it is applied in the easy attack scenario. We first provide another version of the theoretical bounds on E ⇥ N T (a † ) ⇤ and E [C T ] that depends on how close L t (a † ) is to the maximum value. Theorem 4.9. Let ⇢ 2 (0, 1] be any constant. Define T ⇢ = {t | L t (a † ) > 1 ⇢}, i.e., the set of rounds where L t (a † ) is within distance ⇢ to the maximum loss value. Let |T ⇢ | = ⌧ . Also assume that the attacker applies (6) to perform attack, then there exists a constant M > 0 such that the expected number of target arm selections satisfies E ⇥ N T (a † ) ⇤ T ⇢ 1 ↵+✏ 1 ⌧ MT ↵ /⇢, and the cumulative attack cost satisfies E [C T ]  ⇢ 1 ↵+✏ 1 + ⌧ + MT ↵ /⇢. Remark 4.10. In the easy attack scenario, there exists some ⇢ such that ⌧ = 0, thus compared to Theorem 4.3, the more generic attack (6) induces an additional constant term ⇢ 1 ↵+✏ 1 in the bounds of E ⇥ N T (a † ) ⇤ and E [C T ], which is negligible for large enough T .

5. ATTACK COST LOWER BOUND

We have proposed two attack strategies targeting the easy and general attack scenarios separately. In this section, we show that if an attack algorithm achieves T o(T ) target arm selections and is also victim-agnostic, then the cumulative attack cost is at least ⌦(T ↵ ). Note that since we want to derive victim-agnostic lower bound, it is sufficient to pick a particular victim bandit algorithm that guarantees O(T ↵ ) regret and then prove that any victim-agnostic attacker must induce at least some attack cost in order to achieve T o(T ) target arm selections. Specifically, we consider the most popular Exp3 algorithm (see algorithm 1 in the appendix). We first provide the following key lemma, which characterizes a lower bound on the number of arm selections for Exp3. Lemma 5.1. Assume the bandit player applies the Exp3 algorithm with parameter ⌘ (see (34) in the appendix) and initial arm selection probability ⇡ 1 . Let the loss functions be L 1:T . Then 8a 2 A, the total number of rounds where a is selected, N T (a), satisfies E [N T (a)] T ⇡ 1 (a) ⌘T T X t=1 E [⇡ t (a)L t (a)] , where ⇡ t is the arm selection probability at round t. Furthermore, since ⇡ t (a)  1, we have E [N T (a)] T ⇡ 1 (a) ⌘T T X t=1 L t (a). ( ) Remark 5.1. Lemma 5.1 provides two different lower bounds on the number of arm selections based on the loss functions for each arm a. (12) shows that the lower bound on E [N T (a)] increases as the cumulative loss P T t=1 L t (a) of arm a becomes smaller, which coincides with the intuition. In particular, if ⇡ 1 is initialized to the uniform distribution and ⌘ is picked as T 1 2 for some constant , the lower bound (12) becomes E [N T (a)] T /K p T P T t=1 L t (a). One direct conclusion here is that if the loss function of an arm a is always zero, i.e., L t (a) = 0, 8t, then arm a must be selected at least T /K times in expectation. Now we provide our main result in Theorem 5.2, which shows that for a special implementation of Exp3 that achieves O(T ↵ ) regret, any attacker must induce ⌦(T ↵ ) cumulative attack cost. Theorem 5.2. Assume some victim-agnostic attack algorithm achieves E ⇥ N T (a † ) ⇤ = T o(T ) on all victim bandit algorithms that has regret rate O(T ↵ ), where ↵ 2 [ 1 2 , 1). Then there exists a bandit task such that the attacker must induce at least expected attack cost E [C T ] = ⌦(T ↵ ) on some victim algorithm. Specifically, one such victim is the Exp3 algorithm with parameter ⌘ = ⇥(T ↵ ). The lower bound ⌦(T ↵ ) matches the upper bound proved in both Theorem 4.3 and Theorem 4.9 up to a constant, thus our attacks are asymptotically optimal in the easy attack scenario. However, there is a gap compared to the upper bound O(T 1+↵ 2 ) proved for the general attack scenario (corollary 4.8). The gap diminishes as ↵ approaches 1, but how to completely close this gap remains an open problem.

6. EXPERIMENTS

We now perform empirical evaluations of our attacks. We consider two victim adversarial bandit algorithms: the Exp3 algorithm (see Algorithm 1 in the appendix), and a robust version of Exp3 In Figure 2d , we show the cumulative attack costs. For T = 10 6 , the cumulative attack cost on the Exp3 victim is 1.14 ⇥ 10 5 . On average, the per-round attack cost is 0.11. For the ExpRb victim, for T = 10 6 , the cumulative attack costs are 1.40 ⇥ 10 5 , 3.62 ⇥ 10 5 , and 5.15 ⇥ 10 5 for the three different attacker budgets = T 0.5 , T 0.7 , T 0.9 . The per-round attack cost is 0.14, 0.36 and 0.51 respectively. Note that when = T 0.5 , the ExpRb recovers the regret of Exp3. The per-round attack cost for ExpRb is 0.14, which is slightly higher than Exp3. This again shows that ExpRb is indeed more robust than Exp3. Also note that the attack cost grows as ExpRb assumes a larger attacker budget. This is reasonable since larger attacker budget implies stronger robustness of ExpRb. Demirel et al. (2022) . Most of the robust algorithms are designed to recover low regret even in presence of reward corruptions. However, as we illustrated in example 1, recovering low regret does not guarantee successful defense against an attacker who wants to promote a target arm in the adversarial bandit scenario. How to defend against such attacks remains an under-explored question. Of particular interest to our paper is a recent work on designing adversarial bandit algorithms robust to reward corruptions Yang et al. (2020) . The paper assumes that the attacker has a prefixed budget of attack cost , and then designs a robust adversarial bandit algorithm ExpRb, which achieves regret that scales linearly as the attacker budget grows R T = O( p K log KT + K log T ). As a result, the ExpRb can tolerate any attacker with budget = O( p T ) while recovering the standard regret rate of Exp3. We point out that one limitation of ExpRb is that it requires prior knowledge of a fixed attack budget . However, our attack does not have a fixed budget beforehand. Instead, our attack budget depends on the behavior of the bandit player. Therefore, the ExpRb does not directly apply as a defense against our attack. Nevertheless, in our experiments, we pretend that ExpRb assumes some attack budget and evaluate its performance under our attack.

8. CONCLUSION

We studied reward poisoning attacks on adversarial multi-armed bandit algorithms. We proposed attack strategies in both easy and general attack scenarios, and proved that our attack can successfully mislead any no-regret bandit algorithm into selecting a target arm in T o(T ) rounds while incurring only o(T ) cumulative attack cost. We also provided a lower bound on the cumulative attack cost that any victim-agnostic attacker must induce in order to achieve T o(T ) target arm selections, which matches the upper bound achieved by our attack. This shows that our attack is asymptotically optimal. Our study reveals critical security caveats in bandit-based applications, and it remains an open problem how to defend against our attacker whose attack goal is to promote a desired target arm instead of boosting the regret of the victim bandit algorithm.



It suffices for the attacker to know an upper bound on the regret rate to derive all the results in our paper, but for simplicity we assume the attacker knows exactly the regret rate. An exceptional case is when there exists some non-target arm a 0 that also has 0 loss in every round, then a 0 is equally optimal as a † , and without attack a 0 will be selected equally often as a † .



attacks of multi-armed bandit mostly fall into the topic of data poisoning Ma et al. (2019b). Prior works are limited to poisoning attacks on stochastic bandit algorithms. One line of work studies reward poisoning on vanilla bandit algorithms like UCB and ✏-greedy Jun et al. (2018); Zuo (2020); Niss; Xu et al. (2021b); Ma et al. (2018); Liu & Shroff (2019); Wang et al. (2021); Ma (2021); Xu et al., contextual and linear bandits Garcelon et al. (2020), and also best arm identification algorithms Altschuler et al. (2019). Another line focuses on action poisoning attacks Liu & Lai (2020; 2021a) where the attacker perturbs the selected arm instead of the reward signal. Recent study generalizes the reward attacks to broader sequential decision making scenarios such as multi-agent games Ma et al. (2021) and reinforcement learning Ma et al. (2019a); Zhang et al. (2020); Sun et al. (2020); Rakhsha et al. (2021); Xu et al. (2021a);Liu & Lai (2021b), where the problem structure is more complex than bandits. In the multi-agent decision-making scenarios, a related security threat is an internal agent who adopts strategic behaviors to mislead competitors and achieves desired objectives such asDeng et al. (2019);Gleave et al. (2019).There are also prior works that design robust algorithms in the context of stochastic bandits Feng et al. (2020); Guan et al. (2020); Rangi et al. (2021); Ito (2021), linear and contextual bandits Bogunovic et al. (2021); Ding et al. (2021); Zhao et al. (2021); Yang & Ren (2021); Yang (2021), dueling bandits Agarwal et al. (2021), graphical bandits Lu et al. (2021), best-arm identification Zhong et al. (2021), combinatorial bandit Dong et al. (2022), and multi-agent Vial et al. (2022) or federated bandit learning scenarios Mitra et al. (2021);

annex

called ExpRb (see Yang et al. (2020) ). The ExpRb assumes that the attacker has a fixed attack budget . When = O( p T ), the ExpRb recovers the regret of Exp3. However, our attack does not have a fixed budget beforehand. Nevertheless, we pretend that ExpRb assumes some budget (may not be bounded by the cumulative attack cost of our attacker) and evaluate its performance for different 's. Note that as illustrated in example 1, robust bandit algorithms that can recover sublinear regret may still suffer from an attacker who aims at promoting a target arm in the adversarial bandit setting.

6.1. AN EASY ATTACK EXAMPLE

In out first example, we consider a bandit problem with K = 2 arms, a 1 and a 2 . The loss function is 8t, L t (a 1 ) = 0.5 and L t (a 2 ) = 0. Without attack a 2 is the best-in-hindsight arm and will be selected most of the times. The attacker, however, aims at forcing arm a 1 to be selected in almost very round. Therefore, the target arm is a † = a 1 . Note that L t (a † ) = 0.5, 8t, thus this example falls into the easy attack scenario, and we apply (5) to perform attack.(a) T NT (a † ) of ( 5).(b) CT of ( 5).(c) T NT (a † ) of ( 6).(d) CT of (6).Figure 1 : Using ( 5) and ( 6) to perform attack in an easy attack scenario.In the first experiment, we let the total horizon be T = 10 3 , 10 4 , 10 5 and 10 6 . For each T , we run the Exp3 and ExpRb under attack for T rounds, and compute the number of "non-target" arm selections T N T (a † ). We repeat the experiment by 10 trials and take the average. In Figure 1a , we show log(T N T (a † )), i.e., the log value of the total number of averaged "non-target" arm selections, as a function of log T . The error bars are tiny small, thus we ignore them in the plot. Smaller value of log(T N T (a † )) means better attack performance. Note that when no attack happens (blue line), the Exp3 algorithm almost does not select the target arm a † . Specifically, for T = 10 6 , the Exp3 selects a † in 1.45 ⇥ 10 4 rounds, which is only 1.5% of the total horizon. Under attack though, for T = 10 3 , 10 4 , 10 5 , 10 6 , the attacker misleads Exp3 to select a † in 8.15 ⇥ 10 2 , 9.13 ⇥ 10 3 , 9.63 ⇥ 10 4 , and 9.85 ⇥ 10 5 rounds, which are 81.5%, 91.3%, 96.3% and 98.5% of the total horizon. We also plotted the line y = x for comparison. Note that the slope of log(T N T (a † )) is smaller than that of y = x, which means T N T (a † ) grows sublinearly as T increases. This matches our theoretical results in Theorem 4.3. For the other victim ExpRb, we consider different levels of attack budget . The attacker budget assumed by ExpRb must be sublinear, since otherwise the ExpRb cannot recover sublinear regret, and thus not practically useful. In particular, we consider = T 0.5 , T 0.7 and T 0.9 . Note that for = T 0.7 and T 0.9 , the ExpRb cannot recover the O( p T ) regret of Exp3. For T = 10 6 , our attack forces ExpRb to select the target arm in 9.83 ⇥ 10 6 , 8.97 ⇥ 10 6 , and 6.32 ⇥ 10 6 rounds for the three attacker budget above. This corresponds to 98.3%, 89.7%, and 63.2% of the total horizon respectively. Note that the ExpRb is indeed more robust than Exp3 against our attack. However, our attack still successfully misleads the ExpRb to select the target a † very frequently. Also note that the attack performance degrades as the attacker budget grows. This is because the ExpRb becomes more robust as it assumes a larger attack budget .Figure 1b shows the attack cost averaged over 10 trials. For Exp3, the cumulative attack costs are 1.85 ⇥ 10 2 , 8.72 ⇥ 10 2 , 3.67 ⇥ 10 3 , and 1.45 ⇥ 10 4 for the four different T 's. On average, the per-round attack cost is 0.19, 0.09.0.04, and 0.01 respectively. Note that the per-round attack cost diminishes as T grows. Again, we plot the line y = x for comparison. Note that slope of log C T is smaller than that of y = x. This suggests that C T increases sublinearly as T grows, which is consistent with our theoretical results in Theorem 4.3. For ExpRb, for T = 10 6 , our attack incurs cumulative attack costs 1.73 ⇥ 10 4 , 1.03 ⇥ 10 5 and 3.68 ⇥ 10 5 when ExpRb assumes = T 0.5 , T 0.7 and T 0.9 respectively. On average, the per-round attack cost is 0.02, 0.10 and 0.37. Note that our attack induces larger attack cost on ExpRb than Exp3, which means ExpRb is more resilient against Published as a conference paper at ICLR 2023 our attacks. Furthermore, the attack cost grows as ExpRb assumes a larger attack budget . This is again due to that a larger implies that ExpRb is more prepared against attacks, thus is more robust.Next we apply the general attack (6) to verify that (6) can recover the results of Theorem 4.3 in the easy attack scenario. We fix ✏ = 0.25 in (6). In Figures 1c and 1d , we show the number of target arm selections and the cumulative attack cost. For the Exp3 victim, for the four different T 's, the attack (6) forces the target arm to be selected in 8.12 ⇥ 10 2 , 9.12 ⇥ 10 3 , 9.63 ⇥ 10 4 , 9.85 ⇥ 10 5 rounds, which is 81.2%, 91.2%, 96.3% and 98.5% of the total horizon respectively. Compared to (5), the attack performance is just slightly worse. The corresponding cumulative attack costs are 1.89 ⇥ 10 2 , 8.76 ⇥ 10 2 , 3.68 ⇥ 10 3 , and 1.45 ⇥ 10 4 . On average, the per-round attack cost is 0.19, 0.09, 0.04 and 0.01. Compared to (5), the attack cost is almost the same.

6.2. A GENERAL ATTACK EXAMPLE

In our second example, we consider a bandit problem with K = 2 arms and the loss function is 8t, L t (a 1 ) = 1 and L t (a 2 ) = 0. The attacker desires target arm a † = a 1 . This example is hard to attack because the target arm has the maximum loss across the entire T horizon. We apply the general attack (6) to perform attack. We consider T = 10 3 , 10 4 , 10 5 , and 10 6 . The results reported in this section are also averaged over 10 independent trials.In the first experiment, we let the victim bandit algorithm be Exp3 and study how the parameter ✏ affects the performance of the attack. We let ✏ = 0.1, 0.25 and 0.4. In Figure 2a , we show the number of target arm selections for different T 's. Without attack, the Exp3 selects a † in only 1.20 ⇥ 10 2 , 5.27 ⇥ 10 2 , 2.12 ⇥ 10 3 , and 8.07 ⇥ 10 3 rounds, which are 12%, 5.3%, 2.1% and 0.81% of the total horizon. In Figure 2a , we show log(T N T (a † )) as a function of log T for different ✏'s. Note that as ✏ grows, our attack (6) enforces more target arm selections, which is consistent with Theorem 4.6. In particular, for ✏ = 0.4, our attack forces the target arm to be selected in 8.34 ⇥ 10 2 , 9.13 ⇥ 10 3 , 9.58 ⇥ 10 4 , 9.81 ⇥ 10 5 rounds, which are 83.4%, 91.3%, 95.8% and 98.1% of the total horizon. In Figure 2b , we show the cumulative attack cost. Note that according to corollary 4.8, the cumulative attack cost achieves the minimum value at ✏ = 0.25. This is exactly what we see in Figure 2b . Specifically, for ✏ = 0.25, the cumulative attack costs are 4.20 ⇥ 10 2 , 2.84 ⇥ 10 3 , 1.85 ⇥ 10 4 , and 1.14 ⇥ 10 5 . On average, the per-round attack cost is 0.42, 0.28, 0.19 and 0.11 respectively. Note that the per-round attack cost diminishes as T grows. In both Figure 2a In our second experiment, we evaluate the performance of our attack (6) on the robust adversarial bandit algorithm ExpRb Yang et al. (2020) . We fixed ✏ = 0.25 in (6). We consider three levels of attacker budget = T 0.5 , T 0.7 and T 0.9 in ExpRb, corresponding to increasing power of the attacker. In Figure 2c , we show the total number of target arm selections. For T = 10 6 , our attack forces the Exp3 to select the target arm in 9.24 ⇥ 10 5 rounds, which is 92.4% of the total rounds. For the ExpRb victim, for the three different attack budgets 's, our attack forces ExpRb to select the target arm in 8.97 ⇥ 10 5 , 6.65 ⇥ 10 5 and 5.07 ⇥ 10 5 rounds, corresponding to 89.7%, 66.5% and 50.7% of the total horizon respectively. Note that when = T 0.5 , i.e., the ExpRb can recover the regret of Exp3, but our attack still forces target arm selection in almost 90% of rounds. This is smaller than the 92.4% on the Exp3 victim, which demonstrates that ExpRb indeed is more robust than Exp3. Nevertheless, the ExpRb failed to defend against our attack. Even when ExpRb assumes a very large attacker budget like = T 0.9 , our attack still forces the target arm selection in 50.7% of rounds.

