ON CONVERGENCE OF AVERAGE-REWARD OFF-POLICY CONTROL ALGORITHMS IN WEAKLY-COMMUNICATING MDPS

Abstract

We show two average-reward off-policy control algorithms, Differential Q Learning (Wan, Naik, & Sutton 2021a) and RVI Q Learning (Abounadi Bertsekas & Borkar 2001) , converge in weakly-communicating MDPs. Weakly-communicating MDPs are the most general class of MDPs that a learning algorithm with a single stream of experience can guarantee obtaining a policy achieving optimal reward rate. The original convergence proofs of the two algorithms require that all optimal policies induce unichains, which is not necessarily true for weakly-communicating MDPs. To the best of our knowledge, our results are the first showing average-reward off-policy control algorithms converge in weakly-communicating MDPs. As a direct extension, we show that average-reward options algorithms introduced by (Wan, Naik, & Sutton 2021b) converge if the Semi-MDP induced by options is weakly-communicating.

1. INTRODUCTION

Modern reinforcement learning algorithms are designed to maximize the agent's goal in either the episodic setting or the continuing setting. In both settings, there is an agent continually interacting with its world, which is usually assumed to be a Markov Decision Process (MDP). In episodic problems, there is a special terminal state and a set of start states. If the agent reaches the terminal state, it will be reset to one of the start states. Continuing problems are different in that there is no terminal state, and the agent will never be reset by the world. For continuing problems, two commonly considered objectives are the discounted objective and the average-reward objective. The discount factor in the discounted objective has been observed to be deprecated in the function approximation control setting, suggesting that the average-reward objective might be more suitable for continuing problems. In this paper, we consider off-policy control algorithms for the average-reward objective. These algorithms learn a policy that achieves the best possible average-reward rate, using data generated by some other policy that the agent may not have control of. Designing convergent off-policy algorithms for the average-reward objective is challenging. While there are several off-policy learning algorithms in the literature, the only known convergent algorithms are SSP Q-learning and RVI Q-learning, both by Abounadi, Bertsekas, & Borkar (2001) , the algorithm by Ren & Krogh (2001) , and Differential Q-learning by Wan, Naik, & Sutton (2021a) . Others either do not have convergence proofs (Schwartz 1993; Singh 1994; Bertsekas & Tsitsiklis 1996; Das 1999) , or have incorrect proofs (Yang 2016; Gosavi 2004 ). 1The algorithm by Ren & Krogh (2001) requires knowledge of properties of the MDP which are not typically known. The convergence of SSP Q-learning is limited in MDPs with a state being recurrent under all policies. The convergence of the RVI Q-learning algorithm (Abounadi et al. 2001) was developed for unichain MDPs, which just means that the Markov chain induced by any stationary policy is unichainfoot_1 . The convergence of Differential Q-learning (Wan et al. 2021a ) requires a weaker assumption -all optimal policies being unichain. It is clear that RVI Q-learning also converges under this assumption with a small modification of its proof. It is not rare that an optimal policy induces multiple recurrent classes. For example, consider the MDP at the bottom of Figure 1 . If an optimal policy induces multiple recurrent classes, the proofs of RVI Q-learning and Differential Q-learning would not go through. Technically, this is because both two proofs require that the uniqueness of the solution for the action-value function up to an additive constant in the average-reward optimality equation. A more general class of MDPs, called weakly-communicating MDPs, is not limited in the number of recurrent classes under the optimal policy. In these MDPs, all policies may induce multiple recurrent classes. The only assumption is that, except for a set of states that are transient under every policy, all states are reachable from every other state in a finite number of steps with a non-zero probability. It has been observed that the set of weakly-communicating MDPs is the most general set of MDPs such that there exists a learning algorithm that can, using a single stream of experience, guarantee to identify a policy that achieves the optimal average reward rate in the MDP (Barlett & Tewari 2009) . In this paper, we show convergence of RVI Q-learning and Differential Q-learning in weaklycommunicating MDPs, without requiring any additional assumptions compared with their original convergence results. Two key steps in our proof are 1) showing that the solution sets of the two algorithms are non-empty, closed, bounded, and connected, and 2) showing that 0 is the unique solution for the Bellman optimality equation when all rewards are 0. With these two results, we use asynchronous stochastic approximation results by Borkar (2009) to show convergence to the solution sets. As a direct extension of the above results, we also show the convergence of two algorithms that extend the Differential Q-learning algorithm to the options framework, introduced by Wan et al. (2021b) , if the Semi-MDP induced by the MDP and the set of options is weakly-communicating. Middle: There are two deterministic optimal policies, moving to state 2 or going back and forth between state 1 and state 2. Other optimal stationary policies are mixtures of these two deterministic policies. All optimal policies are unichain. Down: There are three deterministic optimal policies: moving to state 1, moving to state 2, and staying at the current state. Other optimal stationary policies are mixtures of these two deterministic policies. The policy of staying at the current state induces two recurrent classes.

2. PRELIMINARIES

Consider a finite Markov decision process, defined by the tuple M . = (S, A, R, p), where S is a set of states, A is a set of actions, R is a set of rewards, and p : S × R × S × A → [0, 1] is the dynamics of the environment. Every time steps t the agent observes the state of the MDP S t ∈ S and chooses an action A t ∈ A using some policy b : A × S → [0, 1], then receives from the environment a reward R t+1 ∈ R and the next state S t+1 ∈ S, and so on. The transition dynamics are defined as p(s ′ , r | s, a) . = Pr(S t+1 = s ′ , R t+1 = r | S t = s, A t = a) for all s, s ′ ∈ S, a ∈ A, and r ∈ R. Denote the set of stationary Markov policies Π. The reward rate of a policy π starting from a given start state s can be defined as: r(π, s) . = lim n→∞ 1 n n t=1 E[R t | S 0 = s, A 0:t-1 ∼ π]. (1) Given an arbitrary MDP, the agent may not even be able to visit all states and would therefore miss the chance of learning, for every state s, a policy that achieves the optimal reward rate sup π r(π, s) and the agent can at most learn an optimal policy for a set of states, in which every state is reachable from every other state. Such a set of states is often called communicating. Formally speaking, we say a set of states communicating, if there exists a policy such that moving from either one state in the set to the other one in the set in a finite number of steps has positive probability. If the entire state space is communicating, we say an MDP communicating. Weakly-communicating MDPs generalizes over communicating MDPs. In weakly-communicating MDPs, in addition to a closed communicating set of states, there is a possibly empty set of states that are under every policy. For weakly-communicating MDPs, there exists a unique optimal reward rate r * , which does not depend on the start state. We say a policy is optimal if it achieves r * regardless of the start state. The goal of an off-policy control algorithm is to learn an optimal policy from the stream of experience . . . , S t , A t , R t+1 , S t+1 , . . . generated by a behavior policy that is not necessarily the same as the agent's learned policy. Both RVI Q-learning and Differential Q-learning achieve this goal by solving r and q in the optimality equation: q(s, a) = s ′ ,r p(s ′ , r | s, a)(r -r + max a ′ q(s ′ , a ′ )), ∀ s ∈ S, a ∈ A. It is known that r * is the unique solution of r and any greedy policy w.r.t. any solution of q is an optimal policy. In addition, shifting any solution of q by any constant vector results in the other solution of q. Finally, unlike in unichain MDPs (or MDPs where all optimal policies are unichain) where all solutions are different by some constant vector, in weakly-communicating MDPs, solutions of q may have multiple degrees of freedom. That is, if q 1 , q 2 are both solutions of q, it is possible that q 1 ̸ = q 2 + ce, ∀ c ∈ R , where e denotes the all-one vector. If the agent has a set of options, it may choose to execute these options. Each option o in O has two components: the option's policy π o : A × S → [0, 1], and the termination probability β o : S → [0, 1]. For simplicity, for any s ∈ S, o ∈ O, we use π(a | s, o) to denote π o (a, s) and β(s, o) to denote β o (s). If the agent executes option o at state s, the option's policy is followed, until the option terminates. Let L be the set of all possible lengths of options and R be the set of all possible cumulative rewards. Note that L and R are possibly countably infinite. Let p(s ′ , r, l | s, o) be, when executing option o starting from state s, the probability of terminating at state s ′ , with cumulative reward r and length l. Formally, for any s, s ′ ∈ S, o ∈ O, r ∈ R, l ∈ L, p can be defined recursively in the following way: p(s ′ , r, l | s, o) = a π(a | s, o) s,r p(s, r | s, a) [β(s, o)I(s = s ′ , r = r, l = 1) + (1 -β(s, o))p(s ′ , r -r, l -1 | s, o)], (3) where I is an indicator function. An MDP M and a set of options O results in an Semi-MDP (SMDP) M = (S, O, L, R, p). If the agent chooses options using a meta-policy (policy-over-options) π : O × S → [0, 1] and execute these options, we denote . . . , Ŝn , Ôn , Rn+1 , Ŝn+1 , . . . as the sequence of option transitions. For this SMDP, the reward rate of π given a start state s can be defined as r C (π, s) . = lim t→∞ E π [ t i=1 R i | S 0 = s]/t. r(π, s) . = lim n→∞ E π [ n i=0 Ri | Ŝ0 = s]/E π [ n i=0 Li | Ŝ0 = s]. Both limits exist and are equivalent (Puterman's (1994) propositions 11.4.1) under the following assumption: Assumption 1. For each option o ∈ O, when executing the option, there is a non-zero probability of terminating the option after at most |S| stages, regardless of the initial states. Proposition 1. Under Assumption 1, the expected value as well as the variance of the execution time and cumulative reward of every option at every state exist and are finite. We say an SMDP is weakly-communicating if the MDP with state space S, action space O, reward space R, and transition function l p(s, r, l | s, o) is weakly-communicating. Just like the MDP setting, if the SMDP is weakly-communicating, the optimal reward rate r * . = sup π∈ Π r(π, s), where Π denotes the set of stationary Markov meta-policies, does not depend on the start state s. In addition, the solutions of q may not be different for a constant. Given an MDP and a set of options, the goal of the off-policy control problem is to find a policy that achieves r * . Inter-option Differential Q-learning achieves this goal by solving the optimality equation for SMDPs (Puterman 1994 ): q(s, o) = s ′ ,r, l p(s ′ , r, l | s, o) r -r • l + max o ′ q(s ′ , o ′ ) , where q and r denote estimates of the option-value function and the reward rate respectively. Just like the MDP setting, r has r * as its unique solution and solutions of q may not be different by a constant vector. Intra-option Differential-learning finds an optimal policy by solving the intra-option optimality equation. q(s, o) = a π(a | s, o) s ′ ,r p(s ′ , r | s, a)(r -r + u q * (s ′ , o)), ∀ s ∈ S, o ∈ O, where u q * (s ′ , o) . = 1 -β(s ′ , o) q(s ′ , o) + β(s ′ , o) max o ′ q(s ′ , o ′ ). The following proposition shows that the set of solutions of equation 5 is the same with that of equation 4. Proposition 2. Any solution of equation 4 is also a solution of equation 5 and vice versa.

3. CONVERGENCE RESULTS

In this section, we present the convergence theory of Differential Q-Learning, RVI Q-Learning in weakly-communicating MDPs and that of the two option extensions of Differential Q-Learning in weakly-communicating SMDPs. Empirical results verifying the convergence of the two MDP algorithms are presented in Appendix B. Differential Q-learning updates a table of estimates Q t : S × A → R as follows: Q t+1 (S t , A t ) . = Q t (S t , A t ) + α ν(t,St,At) δ t , (7) Q t+1 (s, a) . = Q t (s, a), ∀ s, a ̸ = S t , A t , where ν(t, S t , A t ) is the number of times S t , A t has been visited before time step t, α ν(t,St,At) is a step-size sequence, and δ t , the temporal-difference (TD) error, is: δ t . = R t+1 -Rt + max a Q t (S t+1 , a) -Q t (S t , A t ), ( ) where Rt is a scalar estimate of r * , updated by: Rt+1 . = Rt + ηα ν(t,St,At) δ t , and η is a positive constant. We now present the convergence theorem of Differential Q-learning. We first state the required assumptions, which are also required by the original convergence proof of Differential Q-Learning by Wan et al. (2021a) . Assumption 2. For all n ≥ 0, α n > 0, ∞ n=0 α n = ∞, and ∞ n=0 α 2 n < ∞. Assumption 3. Let [•] denote the integer part of (•), for x ∈ (0, 1), sup n α [xn] αn < ∞ and [ym] n=0 αn m n=0 αn → 1 uniformly in y ∈ [x, 1]. Assumption 4. There exists ∆ > 0 such that lim inf n→∞ ν(n, s, a) n + 1 ≥ ∆, a.s., for all s ∈ S, a ∈ A. Furthermore, for all x > 0, let N (n, x) = min m > n : m k=n+1 α k ≥ x , the limit lim n→∞ ν(N (n,x),s,a) k=ν(n,s,a) α k / ν(N (n,x),s ′ ,a ′ ) k=ν(n,s ′ ,a ′ ) α k exists a.s. for all s, s ′ ∈ S, a, a ′ ∈ A. Theorem 1. If M is communicating and Assumptions 1-4 hold, the Differential Q-learning algorithm (Equations 7-9) converges, almost surely, Rt to r * , Q t to the set of solutions of equation 2 and r * -R0 = η s,a q(s, a) - s,a Q 0 (s, a) , and r(π t , s) to r * , for all s ∈ S, where π t is any greedy policy w.r.t. Q t . Remark: If the MDP is weakly-communicating, that is, it contains transient states, the agent eventually reaches the closed communicating state and never returns to the transient states. Elements in Q n that are associated with the closed communicating set converge to a set that depends on the values of Q and R when the MDP reaches the closed communicating set for the first time. Other elements in Q n would only be visited for a finite number of times and can not be guaranteed to converge to their correct values by any learning algorithms. Other conclusions of the theorem remain unchanged. This observation on weakly-communicating MDPs apply also to Theorems 2-4. The update rules of RVI Q-Learning are Q t+1 (S t , A t ) . = Q t (S t , A t ) + α ν(t,St,At) δ t (S t , A t ) (11) Q t+1 (s, a) . = Q t (s, a), ∀ s, a ̸ = S t , A t , where δ t (S t , A t ) . = R t+1 -f (Q t ) + max a Q t (S t+1 , a) -Q t (S t , A t ). ( ) and f : S × A → R satisfying the following assumption. Assumption 5. 1) f is L-Lipschitz, 2) there exists a positive scalar u s.t. f (e) = u and f (x + ce) = f (x) + cu, 3) f (cx) = cf (x). Theorem 2. If M is communicating and Assumptions 1-5 hold, then the RVI Q-Learning algorithm (Equations 11-12) converges, almost surely, Rt to r * , Q t to the set of solutions of equation 2 and r * = f (q), (13) and r(π t , s) to r * , for all s ∈ S, where π t is any greedy policy w.r.t. Q t . Now consider option extensions of Differential Q-Learning. Given an SMDP M = (S, O, L, R, p), inter-option Differential Q learning maintains estimates of option values, and, inspired by Schweitzer (1971) , update estimates using scaled TD errors: Q n+1 ( Ŝn , Ôn ) . = Q n ( Ŝn , Ôn ) + α ν(n, Ŝn, Ôn) δ n /L n ( Ŝn , Ôn ), Q n+1 (s, o) . = Q n (s, o), ∀ s, o ̸ = Ŝn , Ôn , Rn+1 . = Rn + ηα ν(n, Ŝn, Ôn) δ n /L n ( Ŝn , Ôn ), where ν(n, Ŝn , Ôn ) is the number of visits before stage n, L n (•, •) comes from an additional vector of estimates L : S × O → R that approximates the expected lengths of state-option pairs, updated from experience by: L n+1 ( Ŝn , Ôn ) . = L n ( Ŝn , Ôn ) + β ν(n, Ŝn, Ôn) ( Ln -L n ( Ŝn , Ôn )), where {β n } is an another step-size sequence. The TD-error δ n in equation 14 and equation 15 is δ n . = Rn -L n ( Ŝn , Ôn ) Rn + max o Q n ( Ŝn+1 , o) -Q n ( Ŝn , Ôn ), Theorem 3. If M is communicating, Assumptions 1-4 hold, except for using ν(n, s, o) instead of ν(t, s, a), and that 0 ≤ β n ≤ 1, n β n = ∞, and n β 2 n < ∞, inter-option Differential Q-learning (Equations 14-17) converges, almost surely, Q n the set of solutions of equation 4 and r * -R0 = η( s,o q(s, o) - s,o Q 0 (s, o)), Rn to r * , and r(π n , s) to r * where π n is a greedy policy w.r.t. Q n . Intra-option Differential Q learning also maintain estimates of option-values. However, instead of updating the estimates using option transitions, it updates for all options using each action transition. Q t+1 (S t , o) . = Q t (S t , o) + α ν(t,St,o) ρ t (o)δ t (o), ∀ o ∈ O, (19) Q t+1 (s, o) . = Q t (s, o), ∀ s ∈ S, o ∈ O, Rt+1 . = Rt + η o∈O α ν(t,St,o) ρ t (o)δ t (o), where α t is a step-size sequence, ρ t (o) . = π(At|St,o) π(At|St,Ot) is the importance sampling ratio, and: δ t (o) . = R t+1 -Rt + u Qt * (S t+1 , o) -Q t (S t , o), where u * is defined in equation 6. Theorem 4. If M is communicating, Assumptions 1-4 hold, except for using ν(t, s, o) instead of ν(t, s, a), intra-option Differential Q-learning (Equations 19-21) converges, almost surely, Q t to the set of solutions of equation 4 and equation 18, Rt to r * , and r(π t , s) to r * where π t is a greedy policy w.r.t. Q t .

4. CHARACTERIZATION OF THE SOLUTION SET

In this section, we characterize the sets that the algorithms described in the above section converges to. This section plays a key role in showing their convergence. We consider the set of solutions of the SMDP optimality equation 4 and It is known that if the SMDP is weakly-communicating, equation 4 has r * as its unique solution for r. For q, it has been shown by Schweitzer & Federgruen (1978) (we will refer this work for multiple times and thus we use a shorthand 'S&F' for simplicity from now on) in their Theorem 4.2 that the set of solutions for q in equation 4 is closed, unbounded, connected, and possibly non-convex. The next theorem characterizes Q ∞ . Theorem 5. If the SMDP is weakly-communicating and f 1) is Lipschitz, and 2) there exists some c = f (q), u ̸ = 0 such that f (x + ce) = f (x) + cu, ∀ c ∈ R, Q ∞ is non-empty, closed, bounded, connected, and possibly non-convex. Before presenting the proof, first note that our convergence proof does not rely on the convexity property and we delay the proof of non-convexity to Appendix A.5. Proof. First, Q ∞ is non-empty. To see this point, note that for any solution of q in equation 2 , q * , q * + ce is also a solution for any c ∈ R and thus there must be a c such that equation 22 holds because f (x + ce) = f (x) + c for any x, c. Q ∞ is closed because the set of solutions of q in equation 4 is closed by S&F, the set of solutions of q in equation 22 is closed because f is Lipschitz, and the intersection of two closed sets is closed.

Boundedness

We now show that Q ∞ is bounded. To this end, it is enough to show that the solution set of v in v(s) = max o s ′ ,r,l p(s ′ , r, l | s, o)(r -lr * + v(s ′ )), ∀ s ∈ S (23) c = f (r + P v), ( ) where c is any constant, and r(s, o) . = s ′ ,r,l p(s ′ , r, l | s, o)(r -lr * ) P (s, o, s ′ ) . = r,l p(s ′ , r, l | s, o) is bounded. Denote this set as V ∞ . Once we show V ∞ is bounded, Q ∞ is also bounded because each solution for q can be obtained from a solution for v with a linear operation: q(s, o) = s ′ ,r,l p(s ′ , r, l | s, o)(r -lr + v(s ′ )), ∀ s ∈ S, o ∈ O. ( ) In order to show boundedness, we need the following two lemmas, which slightly modify Theorem 4.1 and 5.1 by S&F. To this end, we need to first introduce some definitions. For any π ∈ Π, P π denotes the |S| × |S| transition probability matrix under policy π. That is, P π (s, s ′ ) . = o,r,l π(o | s)p(s ′ , r, l | s, o). Define P ∞ π to be the limiting matrix of P π , which is the Cesaro limit of the sequence {P i π } ∞ i=1 : P ∞ π . = lim n→∞ 1 n n-1 i=0 P i π . ( ) Because S is finite, the Cesaro limit exists and P ∞ π is a stochastic matrix (has row sums equal to 1). Let l π (s) . = o,s ′ ,r,l π(o | s)p(s ′ , r, l | s, o)l. And let the fundamental matrix Z π . = (I -P ∞ π + P ∞ π ) -1 = I + lim γ↑1 ∞ n=1 γ n (P n π -P ∞ π ). Let b(v, π) i . = [r π -l π r * + P ∞ π v -v] i and Π * denote the set of optimal meta-policies. The following two lemmas are the similar to Theorem 4.1 (c) and Theorem 5.1 in S&F, except that 1) the 'max' operates over the set of all optimal policies, instead of the set of all deterministic optimal policies as in Theorem 4.1 (c), and 2) we consider the set of weakly-communicating SMDPs while S&F considers general multi-chain SMDPs. The proofs are also essentially the same. For completeness, we provide the proofs for these two lemmas in Sections A.3-A.4. Lemma 1. If the SMDP is weakly-communicating, v is a solution of equation 23 if and only if v(s) = max π∈ Π * [Z π (r π -l π r * ) + P ∞ π v](s), ∀ s ∈ S, where Π * denotes the set of optimal meta-policies. Define R π as the set of recurrent states for  P ∞ π (i.e., R π . = {s | P ∞ π (s, s) > 0}). Let R * . = {s | s ∈ R π for some policy π ∈ Π * }. x(s) = y α , i ∈ R * α , α = 1, . . . , n * (29) x(s) = max π∈ Π * Z π b(v, π) s + n * β=1   s ′ ∈R * β P ∞ π (s, s ′ )   y β , s ∈ S\R * , y α ≥ Z π b(v, π) s + n * β=1   s ′ ∈R * β P ∞ π (s, s ′ )   y β , α = 1, . . . , n * , s ∈ R * α , π ∈ Π * . ( ) For any β ∈ {1, 2, • • • , n * }, note that there must exist a policy π(β) such that R * β is the only one recurrent class under π(β). To see this point, note that the SMDP is weakly-communicating, and thus we can modify π to obtain a new policy such that all states except for those in R * β are transient. Such a policy satisfies our requirement. Applying this observation and Lemma 2, for any v ∈ V, we have for any given β ∈ {1, . . . , n * }, y α ≥ max s∈R * α, Z π(β) b(v, π(β)) s + s ′ ∈R * β P ∞ π(β) (s, s ′ )y β = max s∈R * α, Z π(β) b(v, π(β)) s + y β , ∀ α = 1, • • • , n * . The first term max s∈R * α, Z π(β) b(v, π(β)) s is a constant given v and π(β). Therefore we see that, for any other solution v + x of equation 23, if y β is arbitrarily large then y α , ∀ α = 1, • • • , n * should also be arbitrarily large. This would violate the Lipschitz assumption on f . To see this point, let f (v) . = f (r + P v). ( ) Let L be the Lipschitz constant of f . L is also the Lipschitz constant of f because P is a stochastic matrix and is thus a non-expansion. Fix v 1 ∈ V ∞ and choose a ṽ2 ∈ V ∞ , denote m = ∥v 1 -ṽ2 ∥. Choose a sufficiently large d > 0 such that κd > L ∥v 1 + de -v 2 ∥ = L ∥v 1 + de -ṽ2 -de∥ = Lm, where v 2 . = ṽ2 + de. Because f (v 1 + de) = f (v 1 ) + κd, f (v 1 + de) -f (v 2 ) = f (v 1 ) + κd -f (v 2 ) = κd > L ∥v 1 + de -v 2 ∥, which violates the Lipschitz assumption. Because the choice of β is arbitrary, V ∞ is upper bounded. In addition, because the choice of β is arbitrary, we have for any α ∈ {1, . . . , n * }, y α ≥ max β∈{1,...,n * } max s∈R * α, Z π(β) b(v, π(β)) s + y β If y α is chosen to be arbitrarily small then y β should also be arbitrarily small for all β = 1, • • • , n * but again this is not allowed due to equation 24 for the same reason. Therefore y α , ∀ α ∈ {1, . . . , n * } can not be arbitrarily small. Thus V ∞ is lower bounded. Combining the upper bound and lower bound, V ∞ is bounded. Therefore Q ∞ is also bounded.

Connectedness

We now show that Q ∞ is connected. To this end, again it is enough to show that V ∞ is connected. Define a function that maps from any v ∈ V to a solution in V ∞ . Specifically, fix c, let g : V → V ∞ with g(v) = v + xe, where x is the solution of f (v + xe) = c and f is defined in equation 32. Note that x is unique given v because f (v + xe) = f (v) + κx and x = (c -f (v))/κ. We now show that g is Lipschitz continuous. Consider any v 1 , v 2 ∈ V. Let x 1 , x 2 satisfy v 1 + x 1 e = g(v 1 ) and v 2 + x 2 e = g(v 2 ) respectively. Again x 1 , x 2 are unique given v 1 , v 2 . Note that f (v 1 + x 1 e) -f (v 2 + x 1 e) = f (v 1 + x 1 e) -f (v 2 + x 2 e) + x 1 -x 2 = c -c + x 1 -x 2 = x 1 -x 2 |x 1 -x 2 | = | f (v 1 + x 1 e) -f (v 2 + x 1 e)| ≤ L ∥v 1 + x 1 e -v 2 -x 1 e∥ = L ∥v 1 -v 2 ∥ ∥g(v 1 ) -g(v 2 )∥ = ∥v 1 + x 1 e -v 2 -x 2 e∥ ≤ ∥v 1 -v 2 ∥ + ∥x 1 e -x 2 e∥ = (L|S| + 1) ∥v 1 -v 2 ∥ . Therefore g is Lipschitz continuous with Lipschitz constant L|S| + 1. Finally, because V is connected and the image of any continuous function on a connected set is connected, g(V) is connected. Note that every point in g(V) belongs to V ∞ by definition. Every point in V ∞ is also a point in g(V). To see this point, pick any x ∈ V ∞ , we can see that x ∈ V and that g 25). (x) = x. Therefore V ∞ = g(V) is connected. Given that V ∞ is connected, Q ∞ should also be connected because Q ∞ is a linear transformation of V ∞ (see Equation The other result we will need to use to show convergence of the four algorithms introduced in the previous section is the following one. With this result, the stability of the algorithms can be established using the result by Borkar and Meyn (2000) (see also, Section 3.2 by Borkar 2009). Lemma 3. If an SMDP is weakly-communicating and all rewards are 0, 0 is the only element in Q ∞ . Proof. Given a weakly-communicating SMDP, by Lemma 1, any solution of the state-value optimality equation equation 23 satisfies v(s) ≥ s ′ ∈R * P ∞ π * (s, s ′ )v(s ′ ), ∀ s ∈ R * , π ∈ Π * where R * is defined right after Lemma 1. Also, because all rewards are 0, R * is the closed communicating class and Π * = Π contains all stationary policies.  C π , s d C π (s)v * (s) ≥ s d C π (s) s ′ ∈C d C π (s ′ )v * (s ′ ) = s ′ ∈C d C π (s ′ )v * (s ′ ). Because l.h.s. = r.h.s. we have v * (s) = s ′ ∈C d C π (s ′ )v * (s ′ ). Therefore v * (s) = v * (s ′ ), ∀ s, s ′ ∈ C. Now for any s, s ′ ∈ R * , there must exists a π ∈ Π * = Π such that there is a path from s to s ′ and a path from s ′ to s, because s, s ′ are in the same communicating class. Therefore s, s ′ are in the same recurrent class under P π . Thus we conclude that v * The solution set of the option-value optimality equation equation 4 is also unique up to some additive constant. To see this point, let v(s) = max o q(s, o), then equation 4 transforms to the state-value optimality equation. Therefore the solution of q in equation 4 satisfies that max o q(•, o) is unique up to an additive constant. Let q * be a solution, q * (s, o 1 ) = q * (s, o 2 ), ∀ s ∈ S, o 1 , o 2 ∈ A because ∀ s, ∈ S, o ∈ O: (s) = v * (s ′ ). Therefore ∀ s, s ′ ∈ R * , v * (s) = v * (s ′ ). q * (s, o) = s ′ ,r p(s ′ , r | s, o) max o ′ q * (s ′ , o ′ ) = s ′ ,r p(s ′ , r | s, o) max o ′ q * (s, o ′ ) = max o ′ q * (s, o ′ ). Together with equation 22, the solution of q is uniquely determined. To see that 0 is the unique solution of q, we can just verify that 0 is a solution of q in equation 2 and equation 22. The lemma is proved.

5. PROOF SKETCH OF THEOREM 1-THEOREM 4

In this section, we sketch the proof of Theorem 1-Theorem 4. It has been shown that all four algorithms introduced above are special cases of the General RVI Q algorithm (Wan et al. 2021a,b) . They also showed that General RVI Q converges under an assumption that is not satisfied for weakly-communicating MDPs/SMDPs. In order to show convergence for weakly-communicating MDPs/SMDPs, we replace this assumption with three weaker assumptions that are satisfied for these MDPs/SMDPs. All other assumptions are the same as those used by Wan et al. (2021a,b) and can be verified for all four algorithms using their arguments. We present General RVI Q and prove its convergence with the three new assumptions in Appendix A.6. The next step of the proof would be verifying the three new assumptions when casting General RVI Q to each of the four algorithms. This should be straightforward given our Theorem 5 and Lemma 3. We delay this part to Appendix A.7. With the three assumptions hold we have the conclusion part of the convergence theorem of General RVI Q holds for each of the four algorithms. The convergence of the reward rates of greedy policies w.r.t. the action/option-values follows the convergence of these values and is shown in Appendix A.8.

6. CONCLUSIONS

In this paper, we provide, for the first time, convergence results of off-policy average-reward control algorithms in weakly-communicating MDPs, which are known to be the most general class of MDPs in which it is possible that a learning algorithm can guarantee obtaining an optimal policy. Specifically, we show two existing algorithms, RVI Q-learning and Differential Q-learning, converge in weakly-communicating MDPs. As an extension, we also showed two off-policy average-reward options learning algorithms converge if the SMDP induced by the options is weakly-communicating.



See Appendix D inWan et al. (2021a) for a discussion about Yang's proof and see Appendix C of this paper for a discussion about Gosavi's proof. A Markov chain is unichain if there is only one recurrent class in Markov chain, plus a possibly empty set of transient states.



Figure1: Examples of three different types of MDPs. In each of the three MDPs, there are two states marked by two circles respectively. There are two actions solid and dashed, both causing deterministic effects. Top: The MDP is unichain under every stationary policy. Middle: There are two deterministic optimal policies, moving to state 2 or going back and forth between state 1 and state 2. Other optimal stationary policies are mixtures of these two deterministic policies. All optimal policies are unichain. Down: There are three deterministic optimal policies: moving to state 1, moving to state 2, and staying at the current state. Other optimal stationary policies are mixtures of these two deterministic policies. The policy of staying at the current state induces two recurrent classes.

where f : S × O → R satisfies Assumption 5, c is an arbitrary real. Denote this set as Q ∞ . It is clear that equation 4 generalizes equation 2 and equation 22 generalizes equation 10, equation 13, and equation 18. And thus the characterization of Q ∞ applies to sets that action/option values in the aforementioned algorithms are claimed to converge to in Theorem 1-Theorem 4.

S&F shows that there exists a policy π ∈ Π * such that R π = R * . Let n(π) be the number of recurrent classes for P ∞ π and let n * . = n(π). Denote the set of recurrent classes of P π as {R * α | 1, 2, • • • , n * }. The following lemma shows that the solution set of equation 23 has n * degrees of freedom. Lemma 2. If the SMDP is weakly-communicating, suppose v and v + x are both solutions of equation 23, then there exists n * constants y 1 , y 2 , • • • , y n * such that

Denote v * as a solution of equation 23. Now pick an arbitrary policy π ∈ Π. Consider every recurrent class C under P π , we have∀ s ∈ C, v * (s) ≥ s ′ ∈C d C π (s ′ )v * (s ′ ), where d C π denotes the stationary distribution of π in the recurrent class C. Sum both sides over s weighted by d

The transient states values are uniquely determined by values of states in R * . Thus the solution set of v in the state-value optimality equation equation 23 is also unique up to some additive constant.

