ADVERSARIAL ATTACKS ON ADVERSARIAL BANDITS

Abstract

We study a security threat to adversarial multi-armed bandits, in which an attacker perturbs the loss or reward signal to control the behavior of the victim bandit player. We show that the attacker is able to mislead any no-regret adversarial bandit algorithm into selecting a suboptimal target arm in every but sublinear (T o(T )) number of rounds, while incurring only sublinear (o(T )) cumulative attack cost. This result implies critical security concern in real-world bandit-based systems, e.g., in online recommendation, an attacker might be able to hijack the recommender system and promote a desired product. Our proposed attack algorithms require knowledge of only the regret rate, thus are agnostic to the concrete bandit algorithm employed by the victim player. We also derived a theoretical lower bound on the cumulative attack cost that any victim-agnostic attack algorithm must incur. The lower bound matches the upper bound achieved by our attack, which shows that our attack is asymptotically optimal.

1. INTRODUCTION

Multi-armed bandit presents a sequential learning framework that enjoys applications in a wide range of real-world domains, including medical treatment Zhou et al. (2019) ; Kuleshov & Precup (2014) , online advertisement Li et al. (2010) , resource allocation Feki & Capdevielle (2011) ; Whittle (1980) , search engines Radlinski et al. (2008) , etc. In bandit-based applications, the learning agent (bandit player) often receives reward or loss signals generated through real-time interactions with users. For example, in search engine, the user reward can be clicks, dwelling time, or direct feedbacks on the displayed website. The user-generated loss or reward signals will then be collected by the learner to update the bandit policy. One security caveat in user-generated rewards is that there can be malicious users who generate adversarial reward signals. For instance, in online recommendation, adversarial customers can write fake product reviews to mislead the system into making wrong recommendations. In search engine, cyber-attackers can create click fraud through malware and causes the search engine to display undesired websites. In such cases, the malicious users influence the behavior of the underlying bandit algorithm by generating adversarial reward data. Motivated by that, there has been a surge of interest in understanding potential security issues in multi-armed bandits, i.e., to what extend are multi-armed bandit algorithms susceptible to adversarial user data. Prior works mostly focused on studying reward attacks in the stochastic multi-armed bandit setting Jun et al. (2018); Liu & Shroff (2019) , where the rewards are sampled according to some distribution. In contrast, less is known about the vulnerability of adversarial bandits, a more general bandit framework that relaxes the statistical assumption on the rewards and allows arbitrary (but bounded) reward signals. The adversarial bandit also has seen applications in a broad class of real-world problems especially when the reward structure is too complex to model with a distribution, such as inventory control Even-Dar et al. (2009) and shortest path routing Neu et al. (2012) . Similarly, the same security problem could arise in adversarial bandits due to malicious users. Therefore, it is imperative to investigate potential security caveats in adversarial bandits, which provides insights to help design more robust adversarial bandit algorithms and applications. In this paper, we take a step towards studying reward attacks on adversarial multi-armed bandit algorithms. We assume the attacker has the ability to perturb the reward signal, with the goal of misleading the bandit algorithm to always select a target (sub-optimal) arm desired by the attacker. Our main contributions are summarized as below. (1) We present attack algorithms that can successfully force arbitrary no-regret adversarial bandit algorithms into selecting any target arm in T o(T ) rounds while incurring only o(T ) cumulative attack cost, where T is the total rounds of bandit play. (2) We show that our attack algorithm is theoretically optimal among all possible victim-agnostic attack algorithms, which means that no other attack algorithms can successfully force T o(T ) target arm selections with a smaller cumulative attack cost than our attacks while being agnostic to the underlying victim bandit algorithm. (3) We empirically show that our proposed attack algorithms are efficient on both vanilla and a robust version of Exp3 algorithm Yang et al. (2020) .

2. PRELIMINARIES

The bandit player has a finite action space A = {1, 2, ..., K}, where K is the total number of arms. There is a fixed time horizon T . In each time step t 2 [T ], the player chooses an arm a t 2 A, and then receives loss `t = L t (a t ) from the environment, where L t is the loss function at time t. In this paper, we consider "loss" instead of reward, which is more standard in adversarial bandits. However, all of our results would also apply in the reward setting. Without loss of generality, we assume the loss functions are bounded: L t (a) 2 [0, 1], 8a, t. Moreover, we consider the so-called non-adaptive environment Slivkins (2019); Bubeck & Cesa-Bianchi ( 2012), which means the loss functions L 1:T are fixed beforehand and cannot change adaptively based on the player behavior after the bandit play starts. The goal of the bandit player is to minimize the difference between the cumulative loss incurred by always selecting the optimal arm in hindsight and the cumulative loss incurred by the bandit algorithm, which is defined as the regret below. Definition 2.1. (Regret). The regret of the bandit player is R T = E " T X t=1 L t (a t ) # min a T X t=1 L t (a), where the expectation is with respect to the randomness in the selected arms a 1:T . We now make the following major assumption on the bandit algorithm throughout the paper. Assumption 2.2. (No-regret Bandit Algorithm). We assume the adversarial bandit algorithm satisfies the "no-regret" property asymptotically, i.e., R T = O(T ↵ ) for some ↵ 2 [ 1 2 , 1) 1 . As an example, the classic adversarial bandit algorithm Exp3 achieves ↵ = 1 2 . In later sections, we will propose attack algorithms that apply not only to Exp3, but also arbitrary no-regret bandit algorithms with regret rate ↵ 2 [ 1 2 , 1). Note that the original loss functions L 1:T in (1) could as well be designed by an adversary, which we refer to as the "environmental adversary". In typical regret analysis of adversarial bandits, it is implicitly assumed that the environmental adversary aims at inducing large regret on the player. To counter the environmental adversary, algorithms like Exp3 introduce randomness into the arm selection policy, which provably guarantees sublinear regret for arbitrary sequence of adversarial loss functions L 1:T .

2.1. MOTIVATION OF ATTACKS ON ADVERSARIAL BANDITS

In many bandit-based applications, an adversary may have an incentive to pursue different attack goals than boosting the regret of the bandit player. For example, in online recommendation, imagine the situation that there are two products, and both products can produce the maximum click-through rate. We anticipate a fair recommender system to treat these two products equally and display them with equal probability. However, the seller of the first product might want to mislead the recommender system to break the tie and recommend his product as often as possible, which will benefit him most. Note that even if the recommender system chooses to display the first product every time, the click-through rate (i.e., reward) of the system will not be compromised because the first product has the maximum click-through rate by assumption, thus there is no regret in always recommending it. In this case, misleading the bandit player to always select a target arm does not boost the regret. We point out that in stochastic bandits, forcing the bandit player to always select a sub-optimal target arm must 1 We assume ↵ 



because prior works Auer et al. (1995); Gerchinovitz & Lattimore (2016) have proved that the regret has lower bound ⌦( p T ).

