ON CONVERGENCE OF AVERAGE-REWARD OFF-POLICY CONTROL ALGORITHMS IN WEAKLY-COMMUNICATING MDPS

Abstract

We show two average-reward off-policy control algorithms, Differential Q Learning (Wan, Naik, & Sutton 2021a) and RVI Q Learning (Abounadi Bertsekas & Borkar 2001), converge in weakly-communicating MDPs. Weakly-communicating MDPs are the most general class of MDPs that a learning algorithm with a single stream of experience can guarantee obtaining a policy achieving optimal reward rate. The original convergence proofs of the two algorithms require that all optimal policies induce unichains, which is not necessarily true for weakly-communicating MDPs. To the best of our knowledge, our results are the first showing average-reward off-policy control algorithms converge in weakly-communicating MDPs. As a direct extension, we show that average-reward options algorithms introduced by (Wan, Naik, & Sutton 2021b) converge if the Semi-MDP induced by options is weakly-communicating.

1. INTRODUCTION

Modern reinforcement learning algorithms are designed to maximize the agent's goal in either the episodic setting or the continuing setting. In both settings, there is an agent continually interacting with its world, which is usually assumed to be a Markov Decision Process (MDP). In episodic problems, there is a special terminal state and a set of start states. If the agent reaches the terminal state, it will be reset to one of the start states. Continuing problems are different in that there is no terminal state, and the agent will never be reset by the world. For continuing problems, two commonly considered objectives are the discounted objective and the average-reward objective. The discount factor in the discounted objective has been observed to be deprecated in the function approximation control setting, suggesting that the average-reward objective might be more suitable for continuing problems. In this paper, we consider off-policy control algorithms for the average-reward objective. These algorithms learn a policy that achieves the best possible average-reward rate, using data generated by some other policy that the agent may not have control of. Designing convergent off-policy algorithms for the average-reward objective is challenging. While there are several off-policy learning algorithms in the literature, the only known convergent algorithms are SSP Q-learning and RVI Q-learning, both by Abounadi, Bertsekas, & Borkar (2001) , the algorithm by Ren & Krogh (2001) , and Differential Q-learning by Wan, Naik, & Sutton (2021a). Others either do not have convergence proofs (Schwartz 1993; Singh 1994; Bertsekas & Tsitsiklis 1996; Das 1999) , or have incorrect proofs (Yang 2016; Gosavi 2004 ).foot_0  The algorithm by Ren & Krogh (2001) requires knowledge of properties of the MDP which are not typically known. The convergence of SSP Q-learning is limited in MDPs with a state being recurrent under all policies. The convergence of the RVI Q-learning algorithm (Abounadi et al. 2001) was developed for unichain MDPs, which just means that the Markov chain induced by any stationary policy is unichain 2 . The convergence of Differential Q-learning (Wan et al. 2021a) requires a weaker assumption -all optimal policies being unichain. It is clear that RVI Q-learning also converges under this assumption with a small modification of its proof. It is not rare that an optimal policy induces multiple recurrent classes. For example, consider the MDP at the bottom of Figure 1 . If an optimal policy induces multiple recurrent classes, the proofs of RVI Q-learning and Differential Q-learning would not go through. Technically, this is because both two proofs require that the uniqueness of the solution for the action-value function up to an additive constant in the average-reward optimality equation. A more general class of MDPs, called weakly-communicating MDPs, is not limited in the number of recurrent classes under the optimal policy. In these MDPs, all policies may induce multiple recurrent classes. The only assumption is that, except for a set of states that are transient under every policy, all states are reachable from every other state in a finite number of steps with a non-zero probability. It has been observed that the set of weakly-communicating MDPs is the most general set of MDPs such that there exists a learning algorithm that can, using a single stream of experience, guarantee to identify a policy that achieves the optimal average reward rate in the MDP (Barlett & Tewari 2009) . In this paper, we show convergence of RVI Q-learning and Differential Q-learning in weaklycommunicating MDPs, without requiring any additional assumptions compared with their original convergence results. Two key steps in our proof are 1) showing that the solution sets of the two algorithms are non-empty, closed, bounded, and connected, and 2) showing that 0 is the unique solution for the Bellman optimality equation when all rewards are 0. With these two results, we use asynchronous stochastic approximation results by Borkar (2009) to show convergence to the solution sets. As a direct extension of the above results, we also show the convergence of two algorithms that extend the Differential Q-learning algorithm to the options framework, introduced by Wan et al. ( 2021b), if the Semi-MDP induced by the MDP and the set of options is weakly-communicating. Middle: There are two deterministic optimal policies, moving to state 2 or going back and forth between state 1 and state 2. Other optimal stationary policies are mixtures of these two deterministic policies. All optimal policies are unichain. Down: There are three deterministic optimal policies: moving to state 1, moving to state 2, and staying at the current state. Other optimal stationary policies are mixtures of these two deterministic policies. The policy of staying at the current state induces two recurrent classes.

2. PRELIMINARIES

Consider a finite Markov decision process, defined by the tuple M . = (S, A, R, p), where S is a set of states, A is a set of actions, R is a set of rewards, and p : S × R × S × A → [0, 1] is the dynamics of the environment. Every time steps t the agent observes the state of the MDP S t ∈ S and chooses an action A t ∈ A using some policy b : A × S → [0, 1], then receives from the environment a reward R t+1 ∈ R and the next state S t+1 ∈ S, and so on. The transition dynamics are defined as p(s ′ , r | s, a) . = Pr(S t+1 = s ′ , R t+1 = r | S t = s, A t = a) for all s, s ′ ∈ S, a ∈ A, and r ∈ R. Denote the set of stationary Markov policies Π. The reward rate of a policy π starting from a given start state s can be defined as: r(π, s) . = lim n→∞ 1 n n t=1 E[R t | S 0 = s, A 0:t-1 ∼ π]. Given an arbitrary MDP, the agent may not even be able to visit all states and would therefore miss the chance of learning, for every state s, a policy that achieves the optimal reward rate sup π r(π, s) and the agent can at most learn an optimal policy for a set of states, in which every state is reachable from every other state. Such a set of states is often called communicating. Formally speaking, we say a set of states communicating, if there exists a policy such that moving from either one state in the set to the other one in the set in a finite number of steps has positive probability. If the entire state space is communicating, we say an MDP communicating. Weakly-communicating



See Appendix D in Wan et al. (2021a) for a discussion about Yang's proof and see Appendix C of this paper for a discussion about Gosavi's proof. A Markov chain is unichain if there is only one recurrent class in Markov chain, plus a possibly empty set of transient states.



Figure1: Examples of different types of MDPs. In each of the three MDPs, there are two states marked by two circles respectively. There are two actions solid and dashed, both causing deterministic effects. Top: The MDP is unichain under every stationary policy. Middle: There are two deterministic optimal policies, moving to state 2 or going back and forth between state 1 and state 2. Other optimal stationary policies are mixtures of these two deterministic policies. All optimal policies are unichain. Down: There are three deterministic optimal policies: moving to state 1, moving to state 2, and staying at the current state. Other optimal stationary policies are mixtures of these two deterministic policies. The policy of staying at the current state induces two recurrent classes.

