ON CONVERGENCE OF AVERAGE-REWARD OFF-POLICY CONTROL ALGORITHMS IN WEAKLY-COMMUNICATING MDPS

Abstract

We show two average-reward off-policy control algorithms, Differential Q Learning (Wan, Naik, & Sutton 2021a) and RVI Q Learning (Abounadi Bertsekas & Borkar 2001), converge in weakly-communicating MDPs. Weakly-communicating MDPs are the most general class of MDPs that a learning algorithm with a single stream of experience can guarantee obtaining a policy achieving optimal reward rate. The original convergence proofs of the two algorithms require that all optimal policies induce unichains, which is not necessarily true for weakly-communicating MDPs. To the best of our knowledge, our results are the first showing average-reward off-policy control algorithms converge in weakly-communicating MDPs. As a direct extension, we show that average-reward options algorithms introduced by (Wan, Naik, & Sutton 2021b) converge if the Semi-MDP induced by options is weakly-communicating.

1. INTRODUCTION

Modern reinforcement learning algorithms are designed to maximize the agent's goal in either the episodic setting or the continuing setting. In both settings, there is an agent continually interacting with its world, which is usually assumed to be a Markov Decision Process (MDP). In episodic problems, there is a special terminal state and a set of start states. If the agent reaches the terminal state, it will be reset to one of the start states. Continuing problems are different in that there is no terminal state, and the agent will never be reset by the world. For continuing problems, two commonly considered objectives are the discounted objective and the average-reward objective. The discount factor in the discounted objective has been observed to be deprecated in the function approximation control setting, suggesting that the average-reward objective might be more suitable for continuing problems. In this paper, we consider off-policy control algorithms for the average-reward objective. These algorithms learn a policy that achieves the best possible average-reward rate, using data generated by some other policy that the agent may not have control of. Designing convergent off-policy algorithms for the average-reward objective is challenging. While there are several off-policy learning algorithms in the literature, the only known convergent algorithms are SSP Q-learning and RVI Q-learning, both by Abounadi, Bertsekas, & Borkar (2001) , the algorithm by Ren & Krogh (2001) , and Differential Q-learning by Wan, Naik, & Sutton (2021a). Others either do not have convergence proofs (Schwartz 1993; Singh 1994; Bertsekas & Tsitsiklis 1996; Das 1999) , or have incorrect proofs (Yang 2016; Gosavi 2004 ). 1The algorithm by Ren & Krogh (2001) requires knowledge of properties of the MDP which are not typically known. The convergence of SSP Q-learning is limited in MDPs with a state being recurrent under all policies. The convergence of the RVI Q-learning algorithm (Abounadi et al. 2001) was developed for unichain MDPs, which just means that the Markov chain induced by any stationary policy is unichainfoot_1 . The convergence of Differential Q-learning (Wan et al. 2021a) requires a weaker



See Appendix D in Wan et al. (2021a) for a discussion about Yang's proof and see Appendix C of this paper for a discussion about Gosavi's proof. A Markov chain is unichain if there is only one recurrent class in Markov chain, plus a possibly empty set of transient states.

