ONLINE LEARNING UNDER ADVERSARIAL CORRUP-TIONS

Abstract

We study the design of efficient online learning algorithms tolerant to adversarially corrupted rewards. In particular, we study settings where an online algorithm makes a prediction at each time step, and receives a stochastic reward from the environment that can be arbitrarily corrupted with probability ∈ [0, 1 2 ). Here is the noise rate the characterizes the strength of the adversary. As is standard in online learning, we study the design of algorithms with small regret over a period of time steps. However, while the algorithm observes corrupted rewards, we require its regret to be small with respect to the true uncorrupted reward distribution. We build upon recent advances in robust estimation for unsupervised learning problems to design robust online algorithms with near optimal regret in three different scenarios: stochastic multi-armed bandits, linear contextual bandits, and Markov Decision Processes (MDPs) with stochastic rewards and transitions. Finally, we provide empirical evidence regarding the robustness of our proposed algorithms on synthetic and real datasets.

1. INTRODUCTION

The study of online learning algorithms has a rich and extensive history (Slivkins, 2019 ). An online learning algorithm makes a sequence of predictions, one per time step and receives reward. The predictions could involve picking an expert from a given set of experts, or picking an action from a set of available actions as in reinforcement learning settings. The goal of the algorithm is to maximize the long term reward resulting from the sequence of predictions made. The performance of such an algorithm is measured in terms of the regret, i.e., the difference in the total reward accumulated by the algorithm and the total reward accumulated by the best expert/action/policy in hindsight. Various online learning models have been studied in the literature depending on whether the rewards are generated i.i.d. from some distribution (Gittins, 1979; Thompson, 1933) or are arbitrary (Auer et al., 2001) , and whether the reward for all actions is observed at each time step (full information setting) (Littlestone & Warmuth, 1994) vs. observing only the reward of the chosen action (bandit setting) (Auer et al., 2002; 2001) . In this work, we initiate the study of online learning algorithms with adversarial reward corruptions. Specifically, we focus on the case where the generated rewards are infrequently masked by an adversary and replaced with potentially unbounded corruptions. In doing so, we develop safeguards that minimize the impact of such corruptions on control algorithms operating in an online setting. For example, consider a reinforcement learning agent that interacts with the real world environment to learn a near optimal policy mapping states to actions. For a given state-action pair (s, a) while the true reward distribution may be stochastic, the observed reward associated with (s, a) will have inherent errors due to real world constraints. We would still like to have online learning algorithms that are robust to these errors and have small regret when compared to the true reward distribution. These considerations are important in many applications such as routing, dynamic pricing, autonomous driving and algorithmic trading. For example, a classic problem in routing involves choosing the best route in the presence of noise in the ETA estimation for any given route. Similarly, dynamic pricing algorithms need to be robust to adversarial spikes in demand that may lead to unwanted price surges. To formally study the above scenarios, we consider an online algorithm that proceeds sequentially, where in each step it makes a prediction and receives a reward. With probability the observed reward is adversarially corrupted. More specifically, we take inspiration from Huber's contamination model that has been successfully applied to study various robust estimation problems in unsupervised learning (Huber, 2011; Tukey, 1975; Chen et al., 2018; Lai et al., 2016; Diakonikolas et al., 2019a) . Assuming that P is the true distribution of the rewards, in our model the reward at each step is generated from (1 -)P + Q where Q is an arbitrary distribution. Here < 1 2 is the noise rate. Under this model we design online algorithms with near optimal regret, scaling with , for three important cases: 1) multi-armed stochastic bandits, 2) linear contextual bandits, and 3) learning in finite state MDPs with stochastic rewards. Overview of results. We first consider the setting of stochastic multi-armed bandits. In this scenario there are k arms numbered 1, 2, . . . k. In the standard multi-armed bandit model, at each time step, the algorithm can pull arm i and get a real valued reward r i generated from a normal distribution with mean µ i and variance σ 2 . 1 We let i * represents the best arm, that is, µ i * ≥ µ i , ∀i. In the -corrupted model we assume that the reward for pulling arm i at time t is rt i ∼ (1 -)N (µ i , σ 2 ) + Q t , where Q t is an arbitrary distribution chosen by an adversary. We assume that the adversary has complete knowledge of the sequence of predictions and rewards up to time t -1 as well as the true mean rewards and any internal state of the algorithm. Over T time steps, the pseudo-regret of an algorithm A that pulls arms (i 1 , i 2 , . . . , i T ) is defined as Reg A = µ i * • T -E[ T t=1 r it ]. (1) Notice that while the adversary masks the true rewards with the corrupted ones, we still measure the overall performance with respect to the true reward distribution. While this setting has been studied in recent works (Lykouris et al., 2018; Gupta et al., 2019; Kapoor et al., 2019) they either assume corruptions of bounded magnitude, or provide sub-optimal performance guarantees (see the discussion in Section A). In particular, we provide the following near-optimal regret guarantee based on a robust implementation of the UCB algorithm (Auer et al., 2002) . Theorem 1 (Informal Theorem). For the adversarially corrupted stochastic multi armed bandit problem, there is an efficient robust online algorithm that achieves a pseudo regret bounded by Õ(σ √ kT ) + O(σ T ). The first term in the above bound is the optimal worst case regret bound achievable for the standard stochastic multi armed bandit setting (Auer et al., 2002) . The second term denotes the additional regret incurred due to the corruptions. Furthermore, the work of Kapoor et al. (2019) showed that additional σ T penalty is unavoidable in the worst case, thereby making the above guarantee optimal up to a constant factor. Furthermore, as in the case of stochastic bandits with no corruptions, we can also obtain instance wise guarantees where the first term above depends on logarithmically in T and on an instance dependent quantity that captures how fare off are the arms as compared to the best one. See Appendix B for details. Next we consider the case of contextual stochastic bandits. We study adversarially corrupted linear contextual bandits. In the standard setting of linear contextual bandits (Li et al., 2010)  distribution N (µ * i • x t i , σ 2 ). In the corrupted setting, we allow at certain time steps, the rewards to be corrupted by an adversary. In particular, we assume that the context vectors are drawn i.i.d. from N (0, I). Given the true context x t i , as in the stochastic bandits setting we let the observed reward be generated from r t i ∼ (1 -)N (µ * i • x t i , σ 2 ) + Q t where Q t is an arbitrary distribution. Given a sequence of arm pulls (i 1 , . . . , i T ) we define the pseudo-regret of an algorithm as Reg A = T t=1 E[max i µ * i • x t i ] -E[ T t=1 r it ]. In the above definition, the expectation is again taken over the distribution of contexts, the stochastic rewards and the internal randomness of the algorithm. For this case we provide the following near optimal regret guarantee 1 For simplicity we assume that the variance for each arm distribution is the same. Our results can also be easily extended to handle different variance and also handle the more standard setting where the true rewards are bounded in [0, 1].



there are k arms with k associated (unknown) mean vectors µ * 1 , . . . , µ * k ∈ R d . At time t, the online algorithm sees k context vectors x t 1 , . . . x t k ∈ R d , one per arm. If the algorithm pulls arm i then the reward is generated from the

