NEAR-OPTIMAL ADVERSARIAL REINFORCEMENT LEARNING WITH SWITCHING COSTS

Abstract

Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs (with a coefficient that is strictly positive and is independent of the time horizon) have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than Ω((HSA) 1/3 T 2/3 ), where T , S, A and H are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on T is Õ( √ T )) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of Õ(H 1/3 ) when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.

1. INTRODUCTION

Reinforcement learning (RL) recently arises as a compelling paradigm for modeling machine learning applications with sequential decision making. In such a problem, an online learner interacts with the environment sequentially over Markov decision processes (MDPs), and aims to find a desirable policy for achieving an accumulated loss (or reward). Various algorithms have been developed for RL problems and have been shown theoretically to achieve polynomial sample efficiency in Zimin & Neu ( 2013 In addition to the metric of losses, switching costs, which capture the costs for changing policies during the execution of RL algorithms, are also attracting increasing attention. This is motivated by many practical scenarios where the online learners cannot change their policies for free. For example, in recommendation systems, each change of the recommendation involves the processing of a huge amount of data and additional computational costs (Theocharous et al., 2015) . Similarly, in healthcare, each change of the medical treatment requires substantial human efforts and timeconsuming tests and trials (Yu et al., 2021) . Such switching costs are also required to be considered in many other areas, e.g., robotics applications (Kober et al., 2013) , education software (Bennane, 2013 ), computer networking (Xu et al., 2018) , and database optimization (Krishnan et al., 2018) . Switching costs have been studied in various problems (please see Sec. 2 for some examples). Among these studies, a relevant line of research is along bandit learning (Geulen et al., 2010; Dekel et al., 2014; Arora et al., 2019; Shi et al., 2022) . Recently, switching costs have received consider-1



); Azar et al. (2017); Jin et al. (2018); Agarwal et al. (2019); Bai et al. (2019); Jin et al. (2020a;b); Cai et al. (2020); Gao et al. (2021); Lykouris et al. (2021); Qiao et al. (2022), etc.

