ADVERSARIAL ATTACKS ON ADVERSARIAL BANDITS

Abstract

We study a security threat to adversarial multi-armed bandits, in which an attacker perturbs the loss or reward signal to control the behavior of the victim bandit player. We show that the attacker is able to mislead any no-regret adversarial bandit algorithm into selecting a suboptimal target arm in every but sublinear (T o(T )) number of rounds, while incurring only sublinear (o(T )) cumulative attack cost. This result implies critical security concern in real-world bandit-based systems, e.g., in online recommendation, an attacker might be able to hijack the recommender system and promote a desired product. Our proposed attack algorithms require knowledge of only the regret rate, thus are agnostic to the concrete bandit algorithm employed by the victim player. We also derived a theoretical lower bound on the cumulative attack cost that any victim-agnostic attack algorithm must incur. The lower bound matches the upper bound achieved by our attack, which shows that our attack is asymptotically optimal.

1. INTRODUCTION

Multi-armed bandit presents a sequential learning framework that enjoys applications in a wide range of real-world domains, including medical treatment Zhou et al. (2019) ; Kuleshov & Precup (2014) , online advertisement Li et al. (2010 ), resource allocation Feki & Capdevielle (2011); Whittle (1980) , search engines Radlinski et al. (2008) , etc. In bandit-based applications, the learning agent (bandit player) often receives reward or loss signals generated through real-time interactions with users. For example, in search engine, the user reward can be clicks, dwelling time, or direct feedbacks on the displayed website. The user-generated loss or reward signals will then be collected by the learner to update the bandit policy. One security caveat in user-generated rewards is that there can be malicious users who generate adversarial reward signals. For instance, in online recommendation, adversarial customers can write fake product reviews to mislead the system into making wrong recommendations. In search engine, cyber-attackers can create click fraud through malware and causes the search engine to display undesired websites. In such cases, the malicious users influence the behavior of the underlying bandit algorithm by generating adversarial reward data. Motivated by that, there has been a surge of interest in understanding potential security issues in multi-armed bandits, i.e., to what extend are multi-armed bandit algorithms susceptible to adversarial user data. Prior works mostly focused on studying reward attacks in the stochastic multi-armed bandit setting Jun et al. ( 2018); Liu & Shroff (2019) , where the rewards are sampled according to some distribution. In contrast, less is known about the vulnerability of adversarial bandits, a more general bandit framework that relaxes the statistical assumption on the rewards and allows arbitrary (but bounded) reward signals. The adversarial bandit also has seen applications in a broad class of real-world problems especially when the reward structure is too complex to model with a distribution, such as inventory control Even-Dar et al. (2009) and shortest path routing Neu et al. (2012) . Similarly, the same security problem could arise in adversarial bandits due to malicious users. Therefore, it is imperative to investigate potential security caveats in adversarial bandits, which provides insights to help design more robust adversarial bandit algorithms and applications. In this paper, we take a step towards studying reward attacks on adversarial multi-armed bandit algorithms. We assume the attacker has the ability to perturb the reward signal, with the goal of misleading the bandit algorithm to always select a target (sub-optimal) arm desired by the attacker. Our ⇤ This work does not relate to the author's position at Amazon, no matter how the author affiliates.

