NEAR-OPTIMAL ADVERSARIAL REINFORCEMENT LEARNING WITH SWITCHING COSTS

Abstract

Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs (with a coefficient that is strictly positive and is independent of the time horizon) have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than Ω((HSA) 1/3 T 2/3 ), where T , S, A and H are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on T is Õ( √ T )) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of Õ(H 1/3 ) when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.

1. INTRODUCTION

Reinforcement learning (RL) recently arises as a compelling paradigm for modeling machine learning applications with sequential decision making. In such a problem, an online learner interacts with the environment sequentially over Markov decision processes (MDPs), and aims to find a desirable policy for achieving an accumulated loss (or reward). Various algorithms have been developed for RL problems and have been shown theoretically to achieve polynomial sample efficiency in Zimin & Neu ( 2013 In addition to the metric of losses, switching costs, which capture the costs for changing policies during the execution of RL algorithms, are also attracting increasing attention. This is motivated by many practical scenarios where the online learners cannot change their policies for free. For example, in recommendation systems, each change of the recommendation involves the processing of a huge amount of data and additional computational costs (Theocharous et al., 2015) . Similarly, in healthcare, each change of the medical treatment requires substantial human efforts and timeconsuming tests and trials (Yu et al., 2021) . Such switching costs are also required to be considered in many other areas, e.g., robotics applications (Kober et al., 2013) , education software (Bennane, 2013 ), computer networking (Xu et al., 2018) , and database optimization (Krishnan et al., 2018) . Switching costs have been studied in various problems (please see Sec. 2 for some examples). Among these studies, a relevant line of research is along bandit learning (Geulen et al., 2010; Dekel et al., 2014; Arora et al., 2019; Shi et al., 2022) . Recently, switching costs have received consider-able attention in more general RL settings (Bai et al., 2019; Gao et al., 2021; Wang et al., 2021; Qiao et al., 2022) . However, these studies have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process. Thus, practical scenarios where the loss distribution could be non-stationary or even adversarial are not characterized or considered. While adversarial RL better models the non-stationary or adversarial changes of the loss distribution, to the best of our knowledge, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? Intuitively, in adversarial RL, since much more often policy switches would be needed to adapt to the time-varying environment, it would be much more difficult to achieve a low regret (including both the standard loss regret and the switching costs, please see ( 6)). Indeed, without a special design to reduce switching, existing algorithms for adversarial RL with T episodes, such as those in Zimin & Neu ( 2013 2021), could yield poor performance with linear-to-T number of policy switches. Thus, the goal of this paper is to make the first effort along this open direction. Our first aim is to develop provably efficient algorithms that enjoy low regrets in adversarial RL with switching costs. This requires a careful reduction of switching under non-stationary or adversarial loss distributions. It turns out that previous approaches to reduce switching in static RL (e.g., those in Bai et al. ( 2019) and Qiao et al. ( 2022)) are not applicable here. Specifically, the high-level idea in static RL is to switch faster at the beginning, while switch slower and slower for later episodes. Such a method performs well in static RL, mainly because after learning enough information about losses at the beginning (by switching faster), the learner can estimate the assumed fixed loss-distribution accurately enough with high probability in later episodes. Thus, even though the learner switches slower and slower, a low regret is still achievable with high probability. In contrast, when the loss distribution could change arbitrarily, this method does not work. This is mainly because what the learner learned in the past may not be that useful for the future. For example, when the loss distribution is adversarial, a state-action pair with small losses in the past may incur large losses in the future. Thus, new ideas are required for addressing switching costs in adversarial RL. Our second aim is to understand fundamentally whether the new challenge of switching costs in adversarial RL significantly increases the regret. This requires a converse result, i.e., a lower bound on the regret, that holds for any RL algorithm. Further, we aim to understand fundamentally whether the adversarial nature of RL indeed requires much more policy switches to achieve a low loss regret. Our contributions: In this paper, we achieve the aforementioned goals and make the following three main contributions. (We use Ω, Θ and Õ to hide constants and logarithmic terms.) First, we provide a lower bound (in Theorem 1) that shows that, for adversarial RL with switching costs, the regret of any algorithm must be larger than Ω((HSA) 1/3 T 2/3 ), where T , S, A and H are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on T is Õ( √ T )) in static RL with switching costs is no longer achievable. Further, we characterize precisely the new trade-off (in Theorem 2) between the standard loss regret and the switching costs due to the adversarial nature of RL. Second, we develop the first-known near-optimal algorithms for adversarial RL with switching costs. As we discussed above, the idea for reducing switching in static RL does not work well here. To handle the losses that can change arbitrarily, our design is inspired by the approach in Shi et al. ( 2022) for bandit learning, but with two novel ideas. (a) We delay each switch by a fixed (but tunable) number of episodes, which ensures that switch occurs only every Õ(T 1/3 ) episodes. (b) The idea in (a) results in consistently long intervals of not switching. Since the bias in estimating losses from such a long interval tends to increase the regret, it is important to construct an unbiased estimate of losses for each interval. To achieve this, the idea in bandit learning is to consider all time-slots in each interval as one time-slot, which necessarily requires a single chosen action in each interval. Such an approach is not applicable to our more general MDP setting, since there is no guarantee to visit a single state-action pair due to state transitions. To resolve this issue, our novel idea is to decompose each interval, and then combine the losses of each state-action pair only from the episodes in which such a state-action pair is visited. Interestingly, although this combination is random and the loss is adversarial, the expectation of the estimated losses is (almost) unbiased.



); Azar et al. (2017); Jin et al. (2018); Agarwal et al. (2019); Bai et al. (2019); Jin et al. (2020a;b); Cai et al. (2020); Gao et al. (2021); Lykouris et al. (2021); Qiao et al. (2022), etc.

); Jin et al. (2020a); Lee et al. (2020) and Lykouris et al. (

