ONLINE LEARNING UNDER ADVERSARIAL CORRUP-TIONS

Abstract

We study the design of efficient online learning algorithms tolerant to adversarially corrupted rewards. In particular, we study settings where an online algorithm makes a prediction at each time step, and receives a stochastic reward from the environment that can be arbitrarily corrupted with probability ∈ [0, 1 2 ). Here is the noise rate the characterizes the strength of the adversary. As is standard in online learning, we study the design of algorithms with small regret over a period of time steps. However, while the algorithm observes corrupted rewards, we require its regret to be small with respect to the true uncorrupted reward distribution. We build upon recent advances in robust estimation for unsupervised learning problems to design robust online algorithms with near optimal regret in three different scenarios: stochastic multi-armed bandits, linear contextual bandits, and Markov Decision Processes (MDPs) with stochastic rewards and transitions. Finally, we provide empirical evidence regarding the robustness of our proposed algorithms on synthetic and real datasets.

1. INTRODUCTION

The study of online learning algorithms has a rich and extensive history (Slivkins, 2019 ). An online learning algorithm makes a sequence of predictions, one per time step and receives reward. The predictions could involve picking an expert from a given set of experts, or picking an action from a set of available actions as in reinforcement learning settings. The goal of the algorithm is to maximize the long term reward resulting from the sequence of predictions made. The performance of such an algorithm is measured in terms of the regret, i.e., the difference in the total reward accumulated by the algorithm and the total reward accumulated by the best expert/action/policy in hindsight. Various online learning models have been studied in the literature depending on whether the rewards are generated i.i.d. from some distribution (Gittins, 1979; Thompson, 1933) or are arbitrary (Auer et al., 2001) , and whether the reward for all actions is observed at each time step (full information setting) (Littlestone & Warmuth, 1994) vs. observing only the reward of the chosen action (bandit setting) (Auer et al., 2002; 2001) . In this work, we initiate the study of online learning algorithms with adversarial reward corruptions. Specifically, we focus on the case where the generated rewards are infrequently masked by an adversary and replaced with potentially unbounded corruptions. In doing so, we develop safeguards that minimize the impact of such corruptions on control algorithms operating in an online setting. For example, consider a reinforcement learning agent that interacts with the real world environment to learn a near optimal policy mapping states to actions. For a given state-action pair (s, a) while the true reward distribution may be stochastic, the observed reward associated with (s, a) will have inherent errors due to real world constraints. We would still like to have online learning algorithms that are robust to these errors and have small regret when compared to the true reward distribution. These considerations are important in many applications such as routing, dynamic pricing, autonomous driving and algorithmic trading. For example, a classic problem in routing involves choosing the best route in the presence of noise in the ETA estimation for any given route. Similarly, dynamic pricing algorithms need to be robust to adversarial spikes in demand that may lead to unwanted price surges. To formally study the above scenarios, we consider an online algorithm that proceeds sequentially, where in each step it makes a prediction and receives a reward. With probability the observed reward is adversarially corrupted. More specifically, we take inspiration from Huber's contamination

