NEAR-OPTIMAL REGRET BOUNDS FOR MODEL-FREE RL IN NON-STATIONARY EPISODIC MDPS

Abstract

We consider model-free reinforcement learning (RL) in non-stationary Markov decision processes (MDPs). Both the reward functions and the state transition distributions are allowed to vary over time, either gradually or abruptly, as long as their cumulative variation magnitude does not exceed certain budgets. We propose an algorithm, named Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), for this setting, which adopts a simple restarting strategy and an extra optimism term. Our algorithm outperforms the state-of-the-art (modelbased) solution in terms of dynamic regret. Specifically, RestartQ-UCB with Freedman-type bonus terms achieves a dynamic regret of O(S ), where S and A are the numbers of states and actions, respectively, ∆ > 0 is the variation budget, H is the number of steps per episode, and T is the total number of steps. We further show that our algorithm is near-optimal by establishing an information-theoretical lower bound of Ω(S ), which to the best of our knowledge is the first impossibility result in non-stationary RL in general.

1. INTRODUCTION

Reinforcement learning (RL) studies the class of problems where an agent maximizes its cumulative reward through sequential interaction with an unknown but fixed environment, usually modeled by a Markov Decision Process (MDP). At each time step, the agent takes an action, receives a random reward drawn from a reward function, and then the environment transitions to a new state according to an unknown transition kernel. In classical RL problems, the transition kernel and the reward functions are assumed to be time-invariant. This stationary model, however, cannot capture the phenomenon that in many real-world decision-making problems, the environment, including both the transition dynamics and the reward functions, is inherently evolving over time. Non-stationarity exists in a wide range of applications, including online advertisement auctions (Cai et al., 2017; Lu et al., 2019 ), dynamic pricing (Board, 2008; Chawla et al., 2016) , traffic management (Chen et al., 2020), healthcare operations (Shortreed et al., 2011) , and inventory control (Agrawal & Jia, 2019). Among the many intriguing applications, we specifically emphasize two research areas that can significantly benefit from progress on non-stationary RL, yet their connections have been largely overlooked in the literature. The first one is sequential transfer in RL (Tirinzoni et al., 2020) or multi-task RL Brunskill & Li (2013) . In this setting, the agent encounters a sequence of tasks over time with different system dynamics and reward functions, and seeks to bootstrap learning by transferring knowledge from previously-solved tasks. The second one is multi-agent reinforcement learning (MARL) (Littman, 1994) , where a set of agents collaborate or compete in a shared environment. In MARL, since the transition and reward functions of the agents are coupled, the environment is non-stationary from each agent's own perspective, especially when the agents learn and update policies simultaneously. A more detailed discussion on how non-stationary RL can benefit sequential transfer, multi-task, and multi-agent RL is given in Appendix A. Learning in a non-stationary MDP is highly non-trivial due to the following challenges. The first one is the exploration vs. exploitation challenge inherited from standard (stationary) RL. An agent needs to explore the uncertain environment efficiently while maximizing its rewards along the way. Classical solutions in stationary RL oftentimes leverage the "optimism in the face of uncertain" principle that adopts an upper confidence bound to guide exploration. These bounds can be either an optimistic estimate of the state transition distributions in model-based solutions (Jaksch et al., 2010), 

Setting

 2020) O(STable 1 : Dynamic regret comparisons for RL in non-stationary MDPs. S and A are the numbers of states and actions, L is the number of abrupt changes, D is the maximum diameter, H is the number of steps per episode, and T is the total number of steps. Gray cells denote results from this paper.or an optimistic estimate of the Q-values in the model-free ones (Jin et al., 2018; Zhang et al., 2020 ). An additional challenge in non-stationary RL is the trade-off between remembering and forgetting.Since the system dynamics vary from one episode to another, all the information collected from previous interactions are essentially out-of-date and biased. In fact, it has been shown that a standard RL algorithm might incur a linear regret if the non-stationarity is not handled properly (Ortner et al., 2019) . On the other hand, the agent does need to maintain a sufficient amount of information from history for future decision making, and learning what to remember becomes a further challenge.In this paper, we introduce an algorithm, named Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), to address the aforementioned challenges in non-stationary RL.Our algorithm utilizes an extra optimism term for exploration, in addition to the standard Hoeffding/Bernstein-based bonus in the upper confidence bound, to counteract the non-stationarity of the MDP. This additional bonus term guarantees that our optimistic Q-value is still an upper bound of the optimal Q -value even when the environment changes. To address the second challenge, we adopt a simple but effective restarting strategy that resets the memory of the agent according to a calculated schedule. Similar strategies have also been considered in non-stationary bandits (Besbes et al., 2014) and non-stationary RL in the un-discounted setting (Jaksch et al., 2010; Ortner et al., 2019) . The restarting strategy ensures that our algorithm only refers to the most up-to-date experience for decision-making. A further advantage of our algorithm is that RestartQ-UCB is model-free.Compared with model-based solutions, our model-free algorithm is more time-and space-efficient, flexible to use, and more compatible with the design of modern deep RL architectures. 

