OFF-POLICY AVERAGE REWARD ACTOR-CRITIC WITH DETERMINISTIC POLICY SEARCH

Abstract

The average reward criterion is relatively less explored as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this paper, we present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion. Using these theorems, we also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We show a finite time analysis of the resulting threetimescale stochastic approximation scheme with linear function approximator and obtain an ϵ-optimal stationary policy with a sample complexity of Ω(ϵ -2.5 ). We compare the average reward performance of our proposed algorithm and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo based environments.

1. INTRODUCTION

The reinforcement learning (RL) paradigm has shown significant promise for finding solutions to decision making problems that rely on a reward-based feedback from the environment. Here one is mostly concerned with the long-term reward acquired by the algorithm. In the case of infinite horizon problems, the discounted reward criterion has largely been studied because of its simplicity. Major recent development in the context of RL in continuous state-action spaces has considered the discounted reward criterion (Schulman et al., 2015; 2017; Lillicrap et al., 2016; Haarnoja et al., 2018) . However, there are very few works which focus on the average reward performance criterion in the continuous state-action setting (Zhang & Ross, 2021; Ma et al., 2021) . The average reward criterion has started receiving attention in recent times and there are papers that discuss the benefits of using this criterion over the discounted reward (Dewanto & Gallagher, 2021; Naik et al., 2019) . One of the reasons being, average reward criteria only considers recurrent states and it happens to be the most selective optimization criterion in recurrent Markov Decision Processes (MDPs) according to n-discount optimality criterion. Please refer Mahadevan (1996) for more details on n-discount optimality criterion. Further, optimization in average reward setting is not dependent on the initial state distribution. Moreover, the discrepancy between the objective function and the evaluation metric, that exists for discounted reward setting, is resolved by opting for the average reward criterion. We encourage the readers to go through Dewanto & Gallagher (2021) ; Naik et al. (2019) for better understanding of the benefits mentioned. There are very few algorithms in literature that optimize the average reward and all of them happen to be on-policy algorithms (Zhang & Ross, 2021; Ma et al., 2021) . It has been demonstrated several times that on-policy algorithms are less sample efficient than off-policy algorithms Lillicrap et al. (2016) ; Haarnoja et al. (2018) ; Fujimoto et al. (2018) for the discounted reward criterion. In this paper we try to find whether the same is true for the average reward criterion. We try to overcome the research gap in development of off-policy average reward algorithms for continuous state and action spaces by proposing an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. The policy evaluation step in the case of the average reward algorithm is equivalent to finding the solution to the Poisson equation (i.e., the Bellman equation for a given policy). Poisson equation, because of its form, does not admit a unique solution but only solutions that are unique up to a constant term. Further, the policy evaluation step in this case consists of finding not just the Differential Q-value function but also the average reward. Thus, because of the required estimation of two quantities instead of one, the role of the optimization algorithm and the target network increases here. Therefore we implement the proposed ARO-DDPG algorithm by using target network and by carefully selecting the optimization algorithm. The following are the broad contributions of our paper: • We provide both on-policy and off-policy deterministic policy gradient theorems for the average reward performance metric. • We present our Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) algorithm. • We perform non-asymptotic convergence analysis and provide a finite time analysis of our three timescale stochastic approximation based actor-critic algorithm using a linear function approximator. • We show the results of implementations using our algorithm with other state-of-the-art algorithms in the literature. The rest of the paper is structured as follows: In Section 2, we present the preliminaries on the MDP framework, the basic setting as well as the policy gradient algorithm. Section 3 presents the deterministic policy gradient theorem and our algorithm. Section 4 then presents the main theoretical results related to the finite time analysis. Section 5 presents the experimental results. In Section 6, we discuss other related work and Section 7 presents the conclusions. The detailed proofs for the finite time analysis are available in the Appendix.

2. PRELIMINARIES

Consider a Markov Decision Process (MDP) M = {S, A, R, P, π} where S ⊂ R n is the (continuous) state space, A ⊂ R m is the (continuous) action space, R : S × A → R denotes the reward function with R(s, a) being the reward obtained under state s and action a. Further, P (•|s, a) denotes the state transition function defined as P : S × A → µ(•), where µ : B(S) → [0, 1] is a probability measure. Deterministic policy π is defined as π : S → A.In the above, B(S) represents the Borel sigma algebra on S. Stochastic policy π r is defined as π r : S → µ ′ (•), where µ ′ : B(A) → [0, 1] and B(A) is the Borel sigma algebra on A. Assumption 1. The Markov process obtained under any policy π is ergodic. Assumption 1 is necessary to ensure existence of steady state distribution of Markov process.

2.1. DISCOUNTED REWARD MDPS

In discounted reward MDPs, discounting is controlled by γ ∈ (0, 1). The following performance metric is optimized with respect to the policy: η(π) = E π [ ∞ t=0 γ t R(s t , a t )] = S ρ 0 (s)V π (s) ds. Here, ρ 0 is the initial state distribution and V π is the value function. V π (s) denotes the long term reward acquired when starting in the state s. V π (s t ) = E π [R(s t , a t ) + γV π (s t+1 )|s t ]. (2)

2.2. AVERAGE REWARD MDPS

The performance metric in the case of average reward MDPs is the long-run average reward ρ(π) defined as follows: ρ(π) = lim where R π (s) △ = R(s, π(s)). The limit in the first equality in equation 3 exists because of Assumption 1. The quantity d π (s) in the second equality in equation 3 corresponds to the steady state probability of the Markov process being in state s ∈ S and it exists and is unique given π from Assumption 1 as well. V π dif f is the differential value function corresponding to the policy π and is defined in (4). Further, the differential Q-value or action-value function Q π dif f is defined in (5). V π dif f (s t ) = E π [ ∞ k=t R(s k , a k ) -ρ(π)|s t ]. Q π dif f (s t , a t ) = E π [ ∞ k=t R(s k , a k ) -ρ(π)|s t , a t ]. Lemma 1. There exists a unique constant k(= ρ(π)) which satisfies the following equation for differential value function V dif f : V π dif f (s t ) = E π [R(s t , a t ) -k + V π dif f (s t+1 )|s t ] Proof. See appendix for the proof.

2.3. POLICY GRADIENT THEOREM

Unlike in Q-learning where we try to find the optimal Q-value function and then infer the policy from it, the policy gradient theorem (Sutton et al., 1999; Silver et al., 2014; Degris et al., 2012) allows us to directly optimize the performance metric via its gradient with respect to the policy parameters. Q-learning can be visualized to be a value iteration scheme while an algorithm based on the policy gradient theorem can be seen as mimicking policy iteration. Sutton et al. (1999) provided the policy gradient theorem for on-policy optimization of both the discounted reward and the average reward algorithms, see ( 7)-(8), respectively. ∇ θ η(π) = S ω π (s) A ∇ θ π r (a|s, θ)Q π (s, a) da ds. (7) ∇ θ ρ(π) = S d π (s) A ∇ θ π r (a|s, θ)Q π dif f (s, a) da ds. In ( 7) ω π denotes the long term discounted state visitation probability density which is defined in equation 9 while d π (s) = lim t→∞ P π t (s) is the steady state probability density on states. P π denotes the transition probability kernel for the Markov chain induced by policy π and P π t is the state distribution at instant t given by (10) . ω π (s) = (1 -γ) ∞ t=0 γ t P π t (s). P π t (s) = S×S... ρ 0 (s 0 ) t-1 k=0 P π (s k+1 |s k ) ds 0 . . . ds t-1 . ( ) The policy gradient theorem in Sutton et al. (1999) is only valid for on-policy algorithms. Degris et al. (2012) proposed an approximate off-policy policy gradient theorem for stochastic policies, see (11) , where d µ stands for the steady state density function corresponding to the policy µ. ∇ θ η(π) ≈ S d µ (s) A ∇ θ π r (a|s, θ)Q π (s, a) da ds. Silver et al. ( 2014) came up with the deterministic policy gradient theorem, see ( 12), which eventually led to the development of very successful Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016) algorithm and Twin Delayed DDPG (TD3) algorithm (Fujimoto et al., 2018) . ∇ θ η(π) = S ω π (s)∇ a Q π (s, a)| a=π(s) ∇ θ π(s, θ) ds. ( )

3. PROPOSED AVERAGE REWARD ALGORITHM

We now propose the deterministic policy gradient theorem for the average reward criterion. The policy gradient estimator has to be derived separately for both the on-policy and off-policy settings. Obtaining the on-policy deterministic policy gradient estimator is straight forward but dealing with the off-policy gradient estimates involves an approximate gradient (Degris et al., 2012) .

3.1. ON-POLICY POLICY GRADIENT THEOREM

We cannot directly use the second equality of (3) to derive the policy gradient theorem because of the inability to take the derivative of steady state density function. Therefore one needs to use ( 6) to obtain the average reward deterministic policy gradient theorem. Theorem 1. The gradient of ρ(π) with respect to policy parameter θ is given as follows: ∇ θ ρ(π) = S d π (s)∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds. ( ) Proof. See appendix for the proof.

3.2. COMPATIBLE FUNCTION APPROXIMATION

The result in this section is mostly inspired from Silver et al. (2014) . Recall that Q π dif f (s, a) is the 'true' differential Q-value of the state-action tuple (s, a) under the parameterized policy π. Now let Q w dif f (s, a) denote the approximate differential Q-value of the (s, a)-tuple when function approximation with parameter w is used. Lemma 2 says that when the function approximator satisfies a compatibility condition (cf. (14,15) ), then the gradient expression in (13,) is also satisfied by Q w dif f in place of Q π dif f . Lemma 2. Assume that the differential Q-value function (5) satisfies the following: 1.∇ w ∇ a Q w dif f (s, a) = ∇ θ π(s, θ). ( ) 2. Differential Q-value function parameter w = w * ϵ optimizes the following error function: ζ(θ, w) = 1 2 S d π (s)∥∇ a Q π dif f (s, a)| a=π(s) -∇ a Q w dif f (s, a)| a=π(s) ∥ 2 ds. ( ) Then, S d π (s)∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds = S d π (s)∇ a Q w dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds. (16) Further, in the case when a linear function approximator is used, we obtain ∇ a Q w dif f (s, a) = ∇ θ π(s, θ) ⊺ w. ( ) Proof. See the appendix for a proof. An important implication of lemma 2 also is that the dimension of the matrix on the left hand side and the right hand side of ( 14) should be the same. Hence the dimensions of the parameters θ (used in the parameterized policy) and w (used to approximate the differential Q-value function) are the same. Lemma 2 shows that the compatible function approximation theorem has the same form in the average reward setting as the discounted reward setting.

3.3. OFF-POLICY POLICY GRADIENT THEOREM

In order to derive off-policy policy gradient theorem it is not possible to use the direction adopted by Degris et al. (2012) for off-policy stochastic policy gradient theorem for the discounted reward setting. We first mention our proposed approximate off-policy deterministic policy gradient theorem and then explain why some alternatives would not have worked. Assumption 2. For the Markov chain obtained from the policy π, let K(•|•) be the transition kernel and S π the steady state measure. Then there exists a > 0 and κ ∈ (0, 1) such that D T V (K t (•|s), S π (•)) ≤ aκ t , ∀t, ∀s ∈ S. Assumption 2 states that Markov chain generated by a policy π follows uniform ergodicity property. This assumption is necessary to get an upper bound on the total variation norm of steady state probability distribution of two policies. This assumption is used in Lemma 12, which in turn is used for Theorem 2. Theorem 2. The approximate gradient of the average reward ρ(π) with respect to the policy parameter θ is given by the following expression: ∇ θ ρ(π) = S d µ (s)∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds. ( ) Further, the approximation error is E(π, µ) = ∥∇ θ ρ(π)-∇ θ ρ(π)∥, where µ represents the behaviour policy. E satisfies E(π, µ) ≤ Z∥θ π -θ µ ∥. ( ) Here, Z = 2 m+1 C(⌈log κ a -1 ⌉ + 1/κ)L t with L t being the Lipchitz constant for the transition probability density function (Assumption 9). Constants a and κ are from Assumption 2, m is the dimension of action space, and C = max s ∥∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ)∥. Proof. See the appendix for a proof. Theorem 2 suggests that the approximation error in the gradient increases as the difference between the target policy π and the behaviour policy µ increases.

3.4. OFF-POLICY ALTERNATIVES

In this section we will talk about what alternatives could be thought of in place of what is suggested in section 3.3 and why those alternatives would not work. 1. One can possibly take inspiration from Degris et al. (2012) and define an objective function, ρ(π), as in (20), which is a naive off-policy version of (3). ρ new (π) = S d µ (s)R π (s) ds. If, however, we take the derivative of ρ new (π) defined above, we get the policy update rule as in (21). ∇ θ ρ new (π) = S d µ (s)∇ a R(s, a)| a=π(s) ∇ θ π(s, θ) ds. ( ) The update rule (21) only considers the reward function and not the transition dynamics of the MDP. In ( 18), the derivative of the objective function includes the differential Q-value function which encapsulates both the information of the reward function and the transition dynamics of the MDP and hence is valid derivative. 2. A lot of work in the off-policy setting relies on importance sampling ratios. Recently a few works devised a method to estimate the steady state probability density ratio of the target and behavior policies (Zhang et al., 2020a; b; Liu et al., 2018; Nachum et al., 2019) . The ratio of steady state densities could be used for deterministic policy optimization but there are certain issues which prohibit its usage, see (22) . ∇ θ ρ(π) = S d µ (s)τ (s)∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds. Here, τ (s) is the steady state probability density ratio defined as d π (s)/d µ (s). In order to calculate τ (s) we need information about (π(a|s), µ(a|s) and P (s ′ |s, a)). We need the ratio π(a|s)/µ(a|s) and for deterministic policies the ratio would be δ(a -π(s)/δ(a -µ(s)), where δ(•) is the Dirac-Delta function: 23), it is clear that the ratio δ(a -π(s)/δ(a -µ(s)) will be undefined for almost all actions a ∈ A. Thus, we cannot use this ratio for deterministic policies. Otherwise, we need P (s ′ |s, π(a)) and P (s ′ |s, µ(a)). It is possible to get the information about P (s ′ |s, µ(a)) by sampling from the Markov process generated by the policy µ but obtaining this information about P (s ′ |s, π(a)) is impossible as in the off-policy setting data from π is assumed to be simply unavailable. δ(a -π(s)) δ(a -µ(s)) =    0 if a = µ(s) ∞ if a = π(s) 0 0 otherwise. (23) From (

3.5. ACTOR-CRITIC UPDATE RULE

Assumption 3. α t , β t , and γ t are the step sizes for critic, target estimator, and actor parameter updates respectively. α t = C α (1 + t) σ β t = C β (1 + t) u γ t = C γ (1 + t) v Here, C α , C β , C γ > 0 and 0 < σ < u < v < 1. α t is at the fastest timescale, β t is at slower timescale and γ t is at the slowest timescale. The critic and average reward parameters are estimated using the TD(0) update rule but use target estimators. We are using target estimators to ensure stability of the iterates of the algorithm. Let {s i , a i , s ′ i } n-1 i=0 denote the batch of sampled data from the replay buffer. ξ j t = 1 2 n-1 i=0 R(s i , a i ) -ρ t -Q wi dif f (s i , a i ) + min(Q w1 dif f , Q w2 dif f )(s ′ i , π(s ′ i , θ t )) 2 j ∈ {1, 2} ξ 3 t = 1 2 n-1 i=0 R(s i , a i ) -ρ t -min(Q w1 dif f , Q w2 dif f )(s i , a i ) + min(Q w1 dif f , Q w2 dif f )(s ′ i , π(s ′ i , θ t )) 2 (25) Equation 24 and 25 are the bellman error for differential Q-value function approximator and average reward estimator respectively. Note here we are using double Q-value function approximator. w i t+1 = w i t -α t ∇ wi ξ i t i ∈ {1, 2} ρ t+1 = ρ t -α t ∇ p ξ 3 t (27) The bellman errors 24 is used to update Q-value function approximator parameters w i t using 26 and the bellman average in 25 is used to update average reward estimator ρ t using 27. ν i = ∇ a min(Q w1 dif f , Q w2 dif f )(s i , a)| a=π(si) ∇ θ π(s i , θ t ) θ t+1 = θ t + γ t n-1 i=0 ν i ( ) Actor update is performed using theorem 2. Actor parameter, θ t , is updated using empirical estimate (28) of the gradient in 18. w i t+1 = w i t + β t (w i t+1 -w i t ) i ∈ {1, 2} ρ t+1 = ρ t + β t (ρ t+1 -ρ t ) θ t+1 = θ t + β t (θ t+1 -θ t ) Equation 30-32 are used to update the target Q-value function approximator w i t , target average reward estimator ρ t and target actor parameter θ t .

4. FINITE TIME ANALYSIS

In this section we present the finite time analysis of the on-policy and off-policy average reward actor critic algorithm with linear function approximators. First we mention the assumptions taken to perform the finite time analysis followed by the main results. Assumption 4. ϕ π (s) = ϕ(s, π(s) denotes the feature vector of state s and satisfies ∥ϕ π (s)∥ ≤ 1. The assumption above is just taken for the sake of convenience. Assumption 5. The reward function is uniformly bounded, viz., |R π (s)| ≤ C r < ∞. Assumption 5 is required to make sure that the average reward objective function is bounded from above. Assumption 6.  Q w dif f (s, a) is Lipchitz continuous w.r.t to a. Thus, ∀w ∥Q w dif f (s, a 1 ) - Q w dif f (s, a 2 )∥ ≤ L a ∥a 1 -a 2 ∥.

4.1. ON-POLICY ANALYSIS

In this section we present the theorem for finite time analysis of the on-policy version of the algorithm with linear function approximator and target estimator for the critic and average reward. Theorem 3. The on-policy average reward actor critic algorithm (Algorithm 2) obtains an ϵ-accurate optimal point with sample complexity of Ω(ϵ -2.5 ). We obtain min 0≤t≤T -1 E||∇ θ ρ(θ t )|| 2 = O 1 T 0.4 + O(1), ≤ ϵ + O(1). Proof. See the appendix for a proof. We want to reach as close as possible to a value of θ such that ∥∇ θ ρ(θ)∥ = 0, which indicates we have found a local maxima. O(1) term is present in the bound because of using linear function approximation and will not reduce as time increases. However, if the O(1) term is small enough, the bound in Theorem 3 shows that as T is increases, the algorithm will get close to the local maxima of the objective function(3). A similar O(1) term is present in (Xiong et al., 2022) . Xiong et al. claims the term will be small upon using neural network for critic.

4.2. OFF-POLICY ANALYSIS

In this section we present the theorem for finite time analysis of off-policy version of the algorithm with linear function approximator and target estimator for the critic and average reward. Theorem 4. The off-policy average reward actor critic algorithm (Algorithm 3) with behavior policy µ obtains an ϵ-accurate optimal point with sample complexity of Ω(ϵ -2.5 ). Here θ µ refers to the behavior policy parameter and θ t refers to the target or current policy parameter. We obtain Under review as a conference paper at ICLR 2023 min 0≤t≤T -1 E∥ ∇ θ ρ(θ t )∥ 2 = O 1 T 0.4 + O(1) + O(W 2 θ ) ≤ ϵ + O(1) + O(W 2 θ ) where W θ := max t ∥θ µ -θ t ∥. Proof. See the appendix for a proof. The significance of finding a bound on ∥ ∇ θ ρ(θ t )∥ is same as explained above for Theorem 3. The error bound in the off-policy algorithm has an extra term O(W 2 θ ). The extra term denotes the error induced because of not using the samples from the current policy for performing updates. W 2 θ will be small when replay buffer is used because replay buffer contains data from policies similar to the current policy. This explains why Theorem 2 can be used with replay buffer.

5. EXPERIMENTAL RESULTS

We conducted experiments on six different environments using the DeepMind control suite (Tassa et al., 2018) and found the performance of ARO-DDPG to be superior than the other algorithms. All the environments selected are infinite horizon tasks. Maximum reward per time step is 1.None of the tasks have a goal reaching nature. We performed all the experiments using 10 different seeds. We show here performance comparisons with two state-of-the-art algorithms: the Average Reward TRPO (ATRPO) (Zhang & Ross, 2021) and the Average Policy Optimization (APO) (Ma et al., 2021) respectively. In general for the average reward performance, not many algorithms are available in the literature. We implemented the ATRPO algorithm using the instructions available in the original paper. We used the original hyper-parameters suggested by the author for ATRPO. For our proposed algorithm we trained the agent for 1 million time steps and evaluated the agent after every 5,000 time steps in the concerned environment. The length of each episode for the training phase was taken to be 1,000 and for the evaluation phase it was taken to be 10,000. The reason for taking longer episode length for evaluation phase was to compare the long term average reward performance of the algorithms. We also tried using episode length of 10,000 for training phase and found that to be giving poor average reward performance. We do not reset the agent if it lands in a state before completing 10,000 steps from where it is unable to escape of its own, while continuing to give a penalty for the remaining length of the episode. That way the cost of failure is very high. While training we updated the actor after performing a fixed number of environment steps. We updated the critic neural network with more frequency as compared to the actor neural network. We used target actor and critic networks along with target estimator of the average reward parameter for stability while using bootstrapping updates. We updated the target network using polyak averaging. We tried to enforce multiple timescales in our algorithm by using different update frequency for actor, critic and polyak averaging for target networks. We also borrowed the double Q-network trick from Fujimoto et al. (2018) . Complete information regarding the set of hyper-parameters used is provided in the appendix.

6. RELATED WORK

Actor-Critic algorithms for average reward performance criterion is much less studied compared to discounted reward performance criterion. One of the earliest works on the average reward criterion is Mahadevan (1996) . In this paper, Mahadevan compares the performance of R-learning with that of Q-learning and concludes that fine tuning is required to get better results from R-learning. R-learning is the average reward version of Q-learning. Later in 1999, Sutton et al. derived the policy gradient theorem for both discounted and average reward criteria (Sutton et al., 1999) , which formed the bedrock for development of the average reward actor-critic algorithms. The first proof of asymptotic convergence of average reward actor-critic algorithms with function approximation appeared in Konda & Tsitsiklis (2003) . In 2007, Bhatnagar et al. proposed incremental natural policy gradient algorithms for the average reward setting and provided the asymptotic convergence proof of these. Recently, Wan et al. presented a Differential Q-learning algorithm and claimed that their algorithm is able to find the exact differential value function without an offset. Further, Wan et al. provided an extension of the options framework from the discounted setting to the average reward setting and demonstrated the performance of the algorithm in the Four-Room domain task. One of the major contributions in off-policy policy evaluation is made by Zhang et al. (2021a) . Here Zhang et al. gave a convergent off-policy evaluation scheme inspired from the gradient temporal difference learning algorithms but involving a primal-dual formulation making the policy evaluation step feasible for a neural network implementation. Zhang et al. (2021b) provided another convergent off-policy evaluation algorithm using target network and l 2 -regularisation. In our work we use the same policy evaluation update. Our work in this paper is actually an extension of the work of Silver et al. (2014) from the discounted to the average reward setting. In Xiong et al. ( 2022), a finite time analysis for deterministic policy gradient algorithm was done for the discounted reward setting. We performed the finite time analysis for the average reward deterministic policy gradient algorithm and in particular obtain the same sample complexity for our algorithm as reported by Wu et al. ( 2020) for stochastic policies.

7. CONCLUSION AND FUTURE WORK

In this paper we presented a deterministic policy gradient theorem for both on-policy and off-policy settings. We then proposed the Average Reward Off-policy Deep Deterministic Policy Gradient(ARO-DDPG) algorithm using neural network and replay buffer for high dimensional MuJoCo based environments. We observed superior performance of ARO-DDPG over existing average reward algorithms (ATRPO and PPO). At the end we provided a finite time analysis for the on-policy and off-policy algorithms obtained from the proposed policy gradient theorem and obtained a sample complexity of Ω(ϵ -2.5 ). Lastly to extend the current line of work, one could try using natural gradient descent based update rule for deterministic policy. Further in the current work we tried optimizing the average reward performance (gain optimality). In the literature, optimizing the differential value function for all the states is mentioned as part of achieving Blackwell optimality. Hence actor-critic algorithms could be designed that not only optimize average reward performance but also differential value function (bias optimality). 

A APPENDIX A.1 ADDITIONAL ASSUMPTIONS, PROOFS OF LEMMAS AND THEOREMS

We make the following additional assumptions. Assumption 9. The transition probability density function for a policy π with parameter θ is Lipschitz continuous w.r.t θ. Thus, max s ′ ,s |P π1 (s ′ |s) -P π2 (s ′ |s)| ≤ L t ∥θ 1 -θ 2 ∥. The above assumption is a standard assumption in theoretical studies in literature. Reference for those assumptions can be found in Xiong et al. (2022) ; Bertsekas (1975) ; Chow & Tsitsiklis (1991) and Dufour & Prieto-Rumeau (2015) . Assumption 10. The reward function for a policy π with parameter θ is Lipschitz continuous w.r.t θ. Thus, max s |R π1 (s) -R π2 (s)| ≤ L r ∥θ 1 -θ 2 ∥. The above assumption can be satified by using a well defined reward function to ensure Lipchitz continuity of reward function w.r.t action and then evoking Assumption 7. Assumption 11. The initial value of target estimators is bounded. Thus, ∥ w0 ∥ ≤ C w and ∥ ρ0 ∥ ≤ (Cr + 2C w ). Assumption 11 is used to enforce the stability of the iterates of target estimators. Assumption 12. Let A(θ) = d π (s)(ϕ π (s)( P π (s ′ |s)ϕ π (s ′ ) ds ′ -ϕ π (s)) ⊺ -ηI) ds. λ min is the lower bound on the minimum eigenvalue of A(θ) for all values of θ. The assumption above is used in Lemma 6 to prove the Lipchitz continuity of optimal critic parameter w * for a particular value of policy parameter θ with respect to θ. Assumption 13. Let A ′ (θ) = d π (s)(ϕ π (s)( P π (s ′ |s)ϕ π (s ′ ) ds ′ -ϕ π (s)) ⊺ ) ds. λ all max is the upper bound on maximum eigenvalue of (A ′ (θ) + A ′ (θ) ⊺ )/2 for all values of θ. Assumption 13 is used to prove the negative definiteness of the matrix A θ (defined in Assumption 12) in Lemma 11. Assumption 14. Let H θ = S d π (s)∇ θ π(s, θ)∇ θ π(s, θ) ⊺ ds. λ ϵ min > 0 is the lower bound on the minimum eigenvalues of H θ for all values of θ. The above assumption is used in Lemma 13 to make sure H θ is invertible and optimal critic parameter w * ϵ according to compatible function approximation lemma (Lemma 2) can be obtained. Similar assumption is present in (Xiong et al., 2022) . Assumption 15. Let A µ of f ′ (θ) = d µ (s)(ϕ π (s)( P π (s ′ |s)ϕ π (s ′ ) ds ′ -ϕ π (s)) ⊺ ) ds. χ all max is the upper bound on maximum eigenvalue of (A µ of f ′ (θ) + A µ of f ′ (θ) ⊺ )/2 for behaviour policy µ and all values of θ. Assumption 15 is used to prove the negative definiteness of the matrix A θ (defined in Lemma 15) in Lemma 16. Lemma 1. There exists a unique constant k(= ρ(π)) which satisfies the following equation for differential value function V dif f : V π dif f (s t ) = E π [R(s t , a t ) -k + V π dif f (s t+1 )|s t ]. Proof. V π dif f (s t ) = R(s t , π(s t )) -k + S P π (s t+1 |s t )V π dif f (s t+1 ) ds t+1 =⇒ V π dif f (s t ) - S P π (s t+1 |s t )V π dif f (s t+1 ) ds t+1 = R(s t , π(s t )) -k =⇒ T -1 t=0 V π dif f (s t ) - S P π (s t+1 |s t )V π dif f (s t+1 ) ds t+1 = T -1 t=0 R(s t , π(s t )) -kT Integrating w.r.t the stationary distribution d π of policy π : T -1 t=0 S d π (s t ) V π dif f (s t ) - S P π (s t+1 |s t )V π dif f (s t+1 ) ds t+1 ds t = T -1 t=0 S d π (s t )R(s t , π(s t )) ds -kT T -1 t=0 S d π (s t )V π dif f (s t ) ds t - S d π (s t+1 )V π dif f (s t+1 ) ds t+1 = T -1 t=0 S d π (s t )R(s t , π(s t )) ds t -kT Note: S d π (s t )V π dif f (s t ) ds t -S d π (s t+1 )V π dif f (s t+1 ) ds t+1 = 0. =⇒ k = 1 T T -1 t=0 S d π (s t )R(s t , π(s t )) ds t =⇒ k = lim T →∞ 1 T T -1 t=0 S d π (s t )R(s t , π(s t )) ds t =⇒ k = ρ(π) (using (3)). Theorem 1. The gradient of ρ(π) with respect to the policy parameter θ is given as follows: ∇ θ ρ(π) = S d π (s)∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds. Proof. Using Lemma 1: V π dif f (s t ) = R(s t , π(s t )) -ρ(π) + S P π (s t+1 |s t )V π dif f (s t+1 ) ds t+1 =⇒ Q π dif f (s t , π(s t )) = R(s t , π(s t )) -ρ(π) + S P π (s t+1 |s t )Q π dif f (s t+1 , π(s t+1 )) ds t+1 Differentiating w.r.t θ, we obtain ∇ θ Q π dif f (s t , π(s t )) = ∇ θ R(s t , π(s t )) -∇ θ ρ(π) + ∇ θ S P π (s t+1 |s t )Q π dif f (s t+1 , π(s t+1 )) ds t+1 = ∇ a R(s t , a)| a=π(st) ∇ θ π(s t ) -∇ θ ρ(π) + S ∇ a P π (s t+1 |s t , a)| a=π(st) ∇ θ π(s t )Q π dif f (s t+1 , π(s t+1 )) ds t+1 + S P π (s t+1 |s t )∇ θ Q π dif f (s t+1 , π(s t+1 )) ds t+1 . Note: ∇ a ρ(π) = ∇ a S d π (s)R π (s) ds = 0. =⇒ ∇ θ Q π dif f (s t , π(s t )) = ∇ a Q π dif f (s t , a)| a=π(st) ∇ θ π(s t ) -∇ θ ρ(π) + S P π (s t+1 |s t )∇ θ Q π dif f (s t+1 , π(s t+1 )) ds t+1 . Integrating w.r.t stationary distribution d π (•) of policy π: S d π (s t )∇ θ Q π dif f (s t , π(s t ))ds t = S d π (s t )∇ a Q π dif f (s t , a)| a=π(st) ∇ θ π(s t )ds t -∇ θ ρ(π) + S d π (s t ) S P π (s t+1 |s t )∇ θ Q π dif f (s t+1 , π(s t+1 )) ds t+1 ds t . Note: S d π (s)P π (s ′ |s) ds = d π (s ′ ). Thus, ∇ θ ρ(π) = S d π (s t )∇ a Q π dif f (s t , a)| a=π(st) ∇ θ π(s t )ds t + S d π (s t+1 )∇ θ Q π dif f (s t+1 , π(s t+1 )) ds t+1 - S d π (s t )∇ θ Q π dif f (s t , π(s t ))ds t . ∇ θ ρ(π) = S d π (s)∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s) ds. Lemma 2. Assume that the differential Q-value function (5) satisfies the following: 1. ∇ w ∇ a Q w dif f (s, a) = ∇ θ π(s, θ). 2. The differential Q-value function parameter w = w * ϵ optimizes the following error function: ζ(θ, w) = 1 2 S d π (s)∥∇ a Q π dif f (s, a)| a=π(s) -∇ a Q w dif f (s, a)| a=π(s) ∥ 2 ds. Then, S d π (s)∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds = S d π (s)∇ a Q w dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds. Further, ∇ a Q w dif f (s, a) = ∇ θ π(s, θ) ⊺ w (for linear function approximator). Proof. Let E(θ, w, s) = ∇ a Q π dif f (s, a)| a=π(s) -∇ a Q w dif f (s, a)| a=π(s) , ζ(θ, w) = 1 2 S d π (s)E(θ, w, s) ⊺ E(θ, w, s) ds. Differentiating w.r.t the critic parameter w, we obtain: ∇ w ζ(θ, w) = S d π (s)∇ w E(θ, w, s)E(θ, w, s) ds = - S d π (s)∇ w ∇ a Q w dif f (s, a)| a=π(s) ∇ a Q π dif f (s, a)| a=π(s) -∇ a Q w dif f (s, a)| a=π(s) ds = 0. Letting ∇ w ∇ a Q w dif f (s, a)| a=π(s) = ∇ θ π(s), we obtain S d π (s)∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds = S d π (s)∇ a Q w dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds. Let us consider the case of linear function approximator with parameter w, i.e., Q w dif f (s, π(s)) = ϕ π (s, π(s)) ⊺ w. We know from above, ∇ w ∇ a Q w dif f (s, a)| a=π(s) = ∇ θ π(s) =⇒ ∇ a ϕ π (s, a)| a=π(s) = ∇ θ π(s). (A.1) Thus, Q w dif f (s, π(s)) = ϕ π (s, π(s)) ⊺ w =⇒ ∇ a Q w dif f (s, a)| a=π(s) = ∇ a ϕ π (s, a)| ⊺ a=π(s) w =⇒ ∇ a Q w dif f (s, a)| a=π(s) = ∇ θ π(s) ⊺ w (using (A.1)). Theorem 2. The approximate gradient of the average reward ρ(π) with respect to the policy parameter θ is given by the following expression: ∇ θ ρ(π) = S d µ (s)∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds. Further, the approximation error is E(π, µ) = ∥∇ θ ρ(π)-∇ θ ρ(π)∥, where µ represents the behaviour policy. E satisfies E(π, µ) ≤ Z∥θ π -θ µ ∥. Here, Z = 2 m+1 C(⌈log κ a -1 ⌉ + 1/κ)L t with L t being the Lipchitz constant for the transition probability density function (Assumption 9). Constants a and κ are from Assumption 2, m is the dimension of action space, and C = max s ∥∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ)∥. Proof. E(π, µ) = ∥∇ θ ρ(π) -∇ θ ρ(π)∥ = ∥ S d π (s)∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds - S d µ (s)∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds∥ ≤ S |d π (s) -d µ (s)|∥∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ)∥ ds ≤ C S |d π (s) -d µ (s)| ds. Here, C = max s ∥∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ)∥. Thus, E(π, µ) ≤ CL d ∥θ π -θ µ ∥ = Z∥θ π -θ µ ∥ (using Lemma12). Here, Z = 2 m+1 C(⌈log κ a -1 ⌉ + 1/κ)L t . Lemma 3. Let the cumulative error of on-policy actor be T -1 t=0 E||∇ θ ρ(θ t )|| 2 and

cumulative error of critic be

T -1 t=0 E||∆w t || 2 . θ t and w t are the actor and linear critic parameter at time t.Bound on the cumulative error of on-policy actor is proven using cumulative error of critic as follows: 1 T T -1 t=0 E||∇ θ ρ(θ t )|| 2 ≤ 2 C r C γ T v-1 + 3C 4 π ( 1 T T -1 t=0 E||∆w t || 2 ) + 3C 4 π (τ 2 + 4 M C 2 w * ϵ ), + C γ L J G 2 θ 1 -v T -v Here, C r is the upper bound on rewards (Assumption 5) , C γ , v are constants used for step size γ t (Assumption 3, ∥∇ θ π(s)∥ ≤ C π (Assumption 7), ∆w t = w t -w * t , τ = max t ∥w * t -w * ϵ,t ∥, w * ϵ is the optimal critic parameter according to Lemma 2. w * t is the optimal parameters given by TD(0) algorithm corresponding to policy parameter θ t . Constant C w * ϵ is defined in Lemma 13. L J is the coefficient used in smoothness condition of the non convex function ρ(θ). Constant G θ is defined in Lemma 7. M is the size of batch of samples used to update parameters. Proof. By [-L J , L J ]-smoothness of non-convex function we have: E[ρ(θ t+1 )] ≥ E[ρ(θ t )] + E⟨∇ θ ρ(θ t ), θ t+1 -θ t ⟩ - L J 2 E∥θ t+1 -θ t ∥ 2 . (A.2) Now, h(B t , w t , θ t ) = 1 M i ∇ a Q π (s t,i , a)| a=π(st,i) ∇ θ π(s t,i ). E⟨∇ θ ρ(θ t ), θ t+1 -θ t ⟩ = γ t E⟨∇ θ ρ(θ t ), h(B t , w t , θ t )⟩ = γ t E⟨∇ θ ρ(θ t ), h(B t , w t , θ t ) -∇ θ ρ(θ t )⟩ + γ t E∥∇ θ ρ(θ t )∥ 2 . (A.3) From (A. 3), we have E⟨∇ θ ρ(θ t ), h(B t , w t , θ t ) -∇ θ ρ(θ t )⟩ ≥ - 1 2 E∥∇ θ ρ(θ t )∥ 2 - 1 2 E∥h(B t , w t , θ t ) -∇ θ ρ(θ t )∥ 2 (∵ x ⊺ y ≥ -∥x∥ 2 /2 -∥y∥ 2 /2). (A.4) From (A.4): E∥h(B t , w t , θ t ) -∇ θ ρ(θ t )∥ 2 = E∥h(B t , w t , θ t ) -h(B t , w * t , θ t ) + h(B t , w * t , θ t ) -h(B t , w * ϵ,t , θ t ) + h(B t , w * ϵ,t , θ t ) -∇ θ ρ(θ t )∥ 2 ≤ 3(E∥h(B t , w t , θ t ) -h(B t , w * t , θ t )∥ 2 1 ⃝ + E∥h(B t , w * t , θ t ) -h(B t , w * ϵ,t , θ t )∥ 2 2 ⃝ + E∥h(B t , w * ϵ,t , θ t ) -∇ θ ρ(θ t )∥ 2 3 ⃝ (A.5) From (A.5): 1 ⃝: E||h(B t , w t , θ t ) -h(B t , w * t , θ t )|| 2 = 1 M || i=0 ∇ a Q wt (s t,i , a)| a=π(st,i) ∇ θ π(s t,i ) - i=0 ∇ a Q w * t (s t,i , a)| a=π(st,i) ∇ θ π(s t,i )|| 2 . Here, by compatible function approximation lemma 2: ∇ a Q w * t (s i , a)| a=π(si) = ∇ θ π(s) ⊺ w. E||h(B t , w t , θ t ) -h(B t , w * t , θ t )|| 2 = E|| 1 M i=0 ∇ θ π(s t,i )∇ θ π(s t,i ) ⊺ (w t -w * t )|| 2 ≤ C 4 π E||w t -w * t || 2 . 2 ⃝ is similar as 1 ⃝: E||h(B t , w * t , θ t ) -h(B t , w * ϵ,t , θ t )|| 2 ≤ C 4 π E||w * t -w * ϵ,t || 2 ≤ C 4 π τ 2 . 3 ⃝ : • By compatible function approximation lemma 2: ∇ θ ρ(θ t ) = S d(s, π(θ t ))∇ θ π(s)∇ θ π(s) ⊺ w * ϵ,t ds = E[h(B t , w * ϵ,t , θ t )] • By lemma 4 (Xiong et al., 2022) , if E[ Ŷ ] = Ȳ , || Ŷ ||, || Ȳ || ≤ C Y then, E|| 1 M M -1 i=0 Ŷi -Ȳ || ≤ 4 C 2 Y M . Using above two bullet points: E||h(B t , w * ϵ,t , θ t ) -∇ θ ρ(θ t )|| 2 ≤ 4 M ||∇ θ π(s)∇ θ π(s) ⊺ w * ϵ,t || 2 ≤ 4C 4 π C 2 wϵ M . Combining 1 ⃝, 2 ⃝ and 3 ⃝ and using in (A.5): E||h(B t , w t , θ t ) -∇ θ ρ(θ t )|| 2 ≤ 3C 4 π (E||w t -w * t || 2 + τ 2 + 4C 2 wϵ M ). (A.6) Using (A.6) in (A.4): E⟨∇ θ ρ(θ t ), h(B t , w t , θ t ) -∇ θ ρ(θ t )⟩ ≥ - 1 2 E||∇ θ ρ(θ t )|| 2 - 3 2 C 4 π (E||w t -w * t || 2 + τ 2 + 4C 2 wϵ M ). (A.7) Using (A.7) in (A.3): E⟨∇ θ ρ(θ t ), θ t+1 -θ t ⟩ ≥ γ t 2 E||∇ θ ρ(θ t )|| 2 - 3γ t 2 C 4 π (E||w t -w * t || 2 + τ 2 + 4C 2 wϵ M ). (A.8) Using (A.8) in (A.2): E[ρ(θ t+1 )] -E[ρ(θ t )] ≥ γ t 2 E||∇ θ ρ(θ t )|| 2 - L J 2 E||θ t+1 -θ t || 2 - 3γ t 2 C 4 π (E||w t -w * t || 2 + τ 2 + 4C 2 wϵ M ) =⇒ E||∇ θ ρ(θ t )|| 2 ≥ 2 γ t E[ρ(θ t+1 )] -E[ρ(θ t )] + 3C 4 π (E||w t -w * t || 2 ) + 3C 4 π (τ 2 + 4C 2 wϵ M ) + L J γ t G 2 θ (using lemma 7) =⇒ T -1 t=0 E||∇ θ ρ(θ t )|| 2 ≥ T -1 t=0 2 γ t E[ρ(θ t+1 )] -E[ρ(θ t )] 1 ⃝ + T -1 t=0 3C 4 π (E||w t -w * t || 2 ) 2 ⃝ + T -1 t=0 3C 4 π (τ 2 + 4C 2 wϵ M ) 3 ⃝ + T -1 t=0 L J γ t G 2 θ 4 ⃝ (using lemma 7) (A.9) From equation A.9 1 ⃝: T -1 t=0 2 γ t E[ρ(θ t+1 )] -E[ρ(θ t )] = 2 T -1 t=0 1 γ t - 1 γ t-1 E[ρ(θ t )] + E[ρ(θ 0 )] γ 0 - E[ρ(θ T )] γ T -1 ≤ 2 T -1 t=0 1 γ t - 1 γ t-1 E[ρ(θ t )] + E[ρ(θ 0 )] γ 0 ≤ 2 T -1 t=0 1 γ t - 1 γ t-1 + γ 0 C r ≤ 2C r γ T -1 = 2C r T v C γ 2 ⃝: T -1 t=0 3C 4 π (E||w t -w * t || 2 ) = T -1 t=0 3C 4 π (E||∆w t || 2 ) 4 ⃝: T -1 t=0 L J γ t G 2 θ ≤ L J G 2 θ C γ T 1-v 1 -v ∵ T -1 t=0 1 1 + t v ≤ T 0 1 t v dt = T 1-v 1 -v Using 1 ⃝-4 ⃝ and dividing equation A.9 by T: 1 T T -1 t=0 E||∇ θ ρ(θ t )|| 2 ≤ 2 C r C γ T v-1 + 3C 4 π ( 1 T T -1 t=0 E||∆w t || 2 ) + 3C 4 π (τ 2 + 4 M C 2 wϵ ) + C γ L J G 2 θ 1 -v T -v Lemma 4. Let the cumulative error of linear critic be T -1 t=0 E||∆w t || 2 and cumulative error of average reward estimator be T -1 t=0 E||∆ρ t || 2 . w t and ρ t are linear critic parameter and average reward estimator at time t respectively. Bound on the cumulative error of critic is proven using cumulative error of average reward estimator as follows: 1 T T -1 t=0 E||∆w t || 2 ≤ 2 2C 2 w λC α T σ-1 + C g C α 1 -σ T -σ + L w G θ λ 1 T γ t α t 2 1/2 + 2(C r + 3C w ) λ 2 + 2 λ 2 1 T T -1 t=0 E||∆ρ t || 2 Here, ∆w t = w t -w * t , ∆ρ t = ρ t -ρ * t . w * t and ρ * t are the optimal parameters given by TD(0) algorithm corresponding to policy parameter θ t . C α , σ are constants and γ t , α t are step-sizes defined in Assumption 3, ∥w t ∥ ≤ C w (Algorithm 2, step 8), C r is the upper bound on rewards (Assumption 5), Constant G θ is defined in Lemma 7, C g = L 2 w λ max t γ 2 t α 2 t G 2 θ + C 2 δ λ , C δ = 2C r + (4 + η)C w . η is the l2-regularisation coefficient from Algorithm 2 and η > λ all max , where λ all max is defined in Lemma 11. λ is defined in Lemma 11. L w is defined in Lemma 6. Proof. w t+1 = w t + α t 1 M M -1 i=0 R π (s t,i ) -ρt + ϕ π (s ′ t,i ) ⊺ wt -ϕ π (s t,i ) ⊺ w t ϕ π (s t,i ) -α t ηw t =⇒ w t+1 -w * t+1 = w t -w * t + w * t -w * t+1 1 ⃝ + α t 1 M M -1 i=0 R π (s t,i ) -ρ * t + ϕ π (s ′ t,i ) ⊺ wt -ϕ π (s t,i ) ⊺ w t ϕ π (s t,i ) -α t ηw t 2 ⃝ + α t 1 M M -1 i=0 ρ * t -ρ t ϕ π (s t,i ) 3 ⃝ + α t 1 M M -1 i=0 ρ t -ρt ϕ π (s t,i ) 4 ⃝ (A.10) From equation A.10: 2 ⃝: 1 M M -1 i=0 R π (s t,i ) -ρ * t + ϕ π (s ′ t,i ) ⊺ wt -ϕ π (s t,i ) ⊺ w t ϕ π (s t,i ) -ηw t = 1 M M -1 i=0 R π (s t,i ) -ρ * t + ϕ π (s ′ t,i ) ⊺ w t -ϕ π (s t,i ) ⊺ w t ϕ π (s t,i ) -ηw t + 1 M M -1 i=0 ϕ π (s t,i )ϕ π (s ′ t,i ) ⊺ ( wt -w t ) = 1 M M -1 i=0 ϕ π (s t,i )ϕ π (s ′ t,i ) ⊺ ( wt -w t ) + g(B t , w t , θ t ) -ḡ(w t , θ t ) + ḡ(w t , θ t ) -ḡ(w * t , θ t ) (A.11) Let g(B t , w t , θ t ) := 1 M M -1 i=0 R π (s t,i ) -ρ * t ϕ π (s t,i ) + 1 M M -1 i=0 ϕ π (s t,i )(ϕ π (s ′ t,i ) - ϕ π (s t,i )) ⊺ -ηI w t Let ḡ(w t , θ t ) := d(s, π(θ t ))ϕ π (s) r π (s) -ρ * t + ρ π (s ′ |s)ϕ π (s ′ ) ⊺ w t ds ′ -ϕ π (s) ⊺ w t ds Using equation A.11 in equation A.10: w t+1 -w * t+1 =w t -w * t + w * t -w * t+1 + + α t 1 M M -1 i=0 (ρ * t -ρ t )ϕ π (s t,i ) + α t 1 M M -1 i=0 (ρ t -ρt )ϕ π (s t,i ) + α t 1 M M -1 i=0 ϕ π (s t,i )ϕ π (s ′ t,i ) ⊺ ( wt -w t ) + α t (g(B t , w t , θ t ) -ḡ(w t , θ t )) + α t (ḡ(w t , θ t ) -ḡ(w * t , θ t )) Let, f (B t , w t , θ t ) := 1 M M -1 i=0 (ρ * t -ρ t )ϕ π (s t,i ) + 1 M M -1 i=0 (ρ t -ρt )ϕ π (s t,i ) + 1 M M -1 i=0 ϕ π (s t,i )ϕ π (s ′ t,i ) ⊺ ( wt -w t ) + g(B t , w t , θ t ) -ḡ(w t , θ t ) + ḡ(w t , θ t ) -ḡ(w * t , θ t ) ||w t+1 -w * t+1 || 2 = ||(w t -w * t ) + (w * t -w * t+1 ) + α t f (B t , w t , θ t )|| 2 = ||w t -w * t || 2 + ||w * t -w * t+1 || 2 + α 2 t ||f (B t , w t , θ t )|| 2 + 2⟨∆w t , w * t -w * t+1 ⟩ + 2α t ⟨∆w t , f (B t , w t , θ t )⟩ + 2α t ⟨w * t -w * t+1 , f (B t , w t , θ t )⟩ E||w t+1 -w * t+1 || 2 ≤ E||∆w t || 2 + 2E||w * t -w * t+1 || 2 + 2α 2 t E||f (B t , w t , θ t )|| 2 + 2E⟨∆w t , w * t -w * t+1 ⟩ + 2α t E⟨∆w t , f (B t , w t , θ t )⟩ = E||∆w t || 2 + 2E||w * t -w * t+1 || 2 1 ⃝ + 2α 2 t E||f (B t , w t , θ t )|| 2 2 ⃝ + 2E⟨∆w t , w * t -w * t+1 ⟩ 3 ⃝ + 2α t E⟨∆w t , 1 M M -1 i=0 (ρ * t -ρ t )ϕ π (s t,i )⟩ 4 ⃝ + 2α t E⟨∆w t , 1 M M -1 i=0 (ρ t -ρt )ϕ π (s t,i )⟩ 5 ⃝ + 2α t E⟨∆w t , 1 M M -1 i=0 ϕ π (s t,i )ϕ π (s ′ t,i ) ⊺ ( wt -w t )⟩ 6 ⃝ + 2α t E⟨∆w t , g(B t , w t , θ t ) -ḡ(w t , θ t )⟩ 7 ⃝ + 2α t E⟨∆w t , ḡ(w t , θ t ) -ḡ(w * t , θ t )⟩ 8 ⃝ (A.12) From equation A.12: 1 ⃝: E||w * t -w * t+1 || 2 ≤ L 2 w E||θ t+1 -θ t || 2 (using lemma 6) ≤ L 2 w γ 2 t G 2 θ (using lemma 7) 2 ⃝: E||f (B t , w t , θ t )|| 2 = E|| 1 M M -1 i=0 (R π (s t,i ) -ρt + ϕ π (s ′ t,i ) ⊺ wt -ϕ π (s t,i ) ⊺ w t )ϕ π (s t,i ) -ηw t || 2 ≤ E || 1 M M -1 i=0 (R π (s t,i ) -ρt + ϕ π (s ′ t,i ) ⊺ wt -ϕ π (s t,i ) ⊺ w t )ϕ π (s t,i )|| + η||w t || Here, ||ϕ π (s)|| < 1 (Assumption 4) |R π (s)| ≤ C r (Assumption 5) ||w t || ≤ C w (Algorithm 2, step 8) |ρ t | ≤ C r + 2C w (lemma 8) || wt || ≤ C w (lemma 9) || ρt || ≤ C r + 2C w (lemma 10) ≤ E 1 M M -1 i=0 ||(R π (s t,i ) -ρt + ϕ π (s ′ t,i ) ⊺ wt -ϕ π (s t,i ) ⊺ w t )ϕ π (s t,i )|| + η||w t || 2 ≤ E(C r + C r + 2C w + 2C w + ηC w ) 2 ≤ E(C δ ) 2 (C δ = 2C r + (4 + η)C w ) ≤ C 2 δ 3 ⃝: E⟨∆w t , w * t -w * t+1 ⟩ ≤ E||∆w t || ||w * t -w * t+1 || ≤ L w E||∆w t || ||θ t+1 -θ t || (using Lemma 6) 4 ⃝: E[⟨∆w t , 1 M M -1 i=0 (ρ * t -ρ t )ϕ π (s t,i )⟩] = E[ 1 M M -1 i=0 ⟨∆w t , ϕ π (s t,i )⟩(ρ * t -ρ t )] ≤ E[ 1 M M -1 i=0 ||∆w t ||||ϕ π (s t,i )|||(ρ * t -ρ t )|] ≤ E||∆w t |||ρ * t -ρ t | = E||∆w t |||∆ρ t | 5 ⃝: E[⟨∆w t , 1 M M -1 i=0 (ρ t -ρt )ϕ π (s t,i )⟩] = E[ 1 M M -1 i=0 ⟨∆w t , ϕ π (s t,i )⟩(ρ t -ρt )] ≤ E[ 1 M M -1 i=0 ||∆w t || ||ϕ π (s t,i )|||ρ t -ρt |] ≤ E[||∆w t |||ρ t -ρt |] ≤ E[||∆w t ||(|ρ t | + |ρ t |)] ≤ E[||∆w t ||]2(C r + 2C w ) (using Lemma 8, 10) 6 ⃝: E⟨∆w t , 1 M M -1 i=0 ϕ π (s t,i )ϕ π (s ′ t,i ) ⊺ ( wt -w t )⟩ ≤ E||∆w t || || 1 M M -1 i=0 ϕ π (s t,i )ϕ π (s ′ t,i ) ⊺ ( wt -w t )|| ≤ E||∆w t || || wt -w t || ≤ 2C w E||∆w t || (using algorithm 2) 7 ⃝: E[⟨∆w t , g(B t , w t , θ t ) -ḡ(w t , θ t )⟩] = E[⟨∆w t , E[g(B t , w t , θ t ) -ḡ(w t , θ t )|∆w t ]⟩] Note: E[g(B t , w t , θ t ) -ḡ(w t , θ t )] = 0 Hence, E[⟨∆w t , g(B t , w t , θ t ) -ḡ(w t , θ t )⟩] = 0 ⃝: E[⟨∆w t , ḡ(w t , θ t ) -ḡ(w * t , θ t )⟩] A(θ t ) = S d π (s, θ t )(ϕ π (s)(E[ϕ π (s ′ )] -ϕ π (s) ⊺ -ηI)ds b(θ t ) = S d π (s, θ t )r π (s)ϕ π (s)ds ḡ(w t , θ t ) -ḡ(w * t , θ t ) = b(θ t ) + A(θ t )w t -b(θ t ) -A(θ t )w * t = A(θ t )(w t -w * t ) Now, E[⟨∆w t , ḡ(w t , θ t ) -ḡ(w * t , θ t )⟩] = E[⟨∆w t , A(θ t )∆w t ⟩] = E[∆w ⊺ t A(θ t )∆w t ] ≤ -λE||∆w t || 2 (Lemma 11) Combining 1 ⃝ -8 ⃝ into equation A.12: E||w t+1 -w * t+1 || 2 ≤ (1 -2λα t )E||∆w t || 2 + 2L 2 w γ 2 t G 2 θ + 2α 2 t C 2 δ + 2L w E||∆w t ||||θ t+1 -θ t || + 2α t E||∆w t |||∆ρ t | + 4α t E||∆w t ||(2C w + C r ) + 4α t C w E||∆w t || =⇒ 2λα t E||∆w t || 2 ≤ E[||∆w t || 2 ] -E||∆w t+1 || 2 + 2L 2 w γ 2 t G 2 θ + 2α 2 t C 2 δ + 2L w γ t G θ E||∆w t || + 2α t E||∆w t |||∆ρ t | + 4α t (C r + 3C w )E||∆w t || =⇒ E||∆w t || 2 ≤ 1 2λα t (E||∆w t || 2 -E||w t+1 || 2 ) + L 2 w γ 2 t λα t G 2 θ + α t λ C 2 δ + L w λ γ t α t G θ E||∆w t || + E||∆w t |||∆ρ t | λ + 2 λ (C r + 3C w )E||∆w t || =⇒ T -1 t=0 E||∆w t || 2 ≤ T -1 t=0 1 2λα t (E||∆w t || 2 -E||∆w t+1 || 2 ) 1 ⃝ + T -1 t=0 L w λ γ 2 t α t G 2 θ + α t λ C 2 δ 2 ⃝ + T -1 t=0 L w λ γ t α t G θ E||∆w t || 3 ⃝ + T -1 t=0 E||∆w t |||∆ρ t | λ 4 ⃝ + T -1 t=0 2 λ (C r + 3C w )E||∆w t || 5 ⃝ (A.13) From equation A.13: 1 ⃝: 1 2λ T -1 t=0 (E||∆w t || 2 -E||∆w t+1 || 2 ) 1 α t = 1 2λ T -1 t=1 1 α t - 1 α t-1 E||∆w t || 2 + 1 α 0 E||∆w 0 || 2 - 1 α T -1 E||∆w T || 2 ≤ 1 2λ T -1 t=1 1 α t - 1 α t-1 + 1 α 0 4C 2 w ≤ 4C 2 w 2λα T -1 = C 2 w λC α T σ (∵ α t = C α (1 + t) α ) 2 ⃝: T -1 t=0 L 2 w λ γ 2 t α t G 2 θ + α t λ C 2 δ = T -1 t=0 L 2 w λ γ 2 t α 2 t G 2 θ + C 2 δ λ α t ≤ T -1 t=0 L 2 w λ max t γ 2 t α 2 t G 2 θ + C 2 δ λ α t = T -1 t=0 C g α t = C g C α 1 -σ T 1-σ C g = L 2 w λ max t γ 2 t α 2 t G 2 θ + C 2 δ λ 3 ⃝: T -1 t=0 L w λ γ t α t G θ E||∆w t || = L w λ G θ T -1 t=0 γ t α t E||∆w t || ≤ L w λ G θ T -1 t=0 γ t α t 2 1 2 T -1 t=0 (E||∆w t ||) 2 1 2 (Using Cauchy Schwartz inequality) ≤ L w λ G θ T -1 t=0 γ t α t 2 1 2 T -1 t=0 E||∆w t || 2 1 2 (Using Jensen's inequality) 4 ⃝: 1 λ T -1 t=0 E||∆w t |||∆ρ t | ≤ 1 λ T -1 t=0 (E||∆w t ||) 2 1 2 T -1 t=0 (E|∆ρ t |) 2 1 2 ≤ 1 λ T -1 t=0 E||∆w t || 2 1 2 T -1 t=0 E|∆ρ t | 2 1 2 5 ⃝: T -1 t=0 2(C r + 3C w ) λ E||∆w t || ≤ 2(C r + 3C w ) λ T -1 t=0 E||∆w t || 2 1 2 T -1 t=0 1 1 2 ≤ 2(C r + 3C w ) λ T 1 2 T -1 t=0 E||∆w t || 2 1 2 Combining 1 ⃝ -5 ⃝ into equation A.13: 1 T T -1 t=0 E||∆w t || 2 ≤ 2C 2 w λC α T σ-1 + C g C α 1 -σ T -σ + L w G θ λ 1 T T -1 t=0 γ t α t 2 1 2 1 T T -1 t=0 E||∆w t || 2 1 2 + 1 λ 1 T T -1 t=0 E||∆w t || 2 1 2 1 T T -1 t=0 E||∆ρ t || 2 1 2 + 2(C r + 3C w ) λ 1 T T -1 t=0 E||∆w t || 2 1 2 Let, M (T ) = 1 T T -1 t=0 E||∆w t || 2 N (T ) = 1 T T -1 t=0 E|∆ρ t | 2 M (T ) ≤ K 1 + K 2 M (T ) + K 3 M (T ) N (T ) K 1 := 2C 2 w λC α T σ-1 + C g C α 1 -σ T -σ K 2 := L w G θ λ 1 T T -1 t=0 γ t α t 2 1 2 + 2(C r + 3C w ) λ 3 := 1 λ M (T ) -2 K 2 2 M (T ) -2 K 3 2 M (T ) N (T ) + 2 K 2 2 K 3 2 N (T ) + K 2 2 2 + K 3 2 N (T ) 2 ≤ K 1 + K 2 2 2 + K 3 2 N (T ) 2 + 2 K 2 2 K 3 2 N (T ) =⇒ M (T ) - K 2 2 - K 3 2 N (T ) 2 ≤ K 1 + K 2 2 + K 3 2 N (T ) 2 =⇒ M (T ) - K 2 2 - K 3 2 N (T ) ≤ K 1 + K 2 2 + K 3 2 N (T ) =⇒ M (T ) ≤ K 1 + K 2 + K 3 N (T ) =⇒ M (T ) ≤ 2( K 1 + K 2 ) 2 + 2K 2 3 N (T ) 1 T T -1 t=0 E||∆w t || 2 ≤ 2 2C 2 w λC α T σ-1 + C g C α 1 -σ T -σ + L w G θ λ 1 T T -1 t=0 γ t α t 2 1 2 + 2(C r + 3C w ) λ 2 + 2 λ 2 1 T T -1 t=0 E||∆ρ t || 2 Lemma 5. Let the cumulative error of linear critic be T -1 t=0 E||∆w t || 2 and cumulative error of average reward estimator be T -1 t=0 E||∆ρ t || 2 . w t and ρ t are linear critic parameter and average reward estimator at time t respectively. Bound on the cumulative error of average reward estimator is proven using cumulative error of critic as follows: 1 T T -1 t=0 E|∆ρ t | 2 ≤ 2 2(C r + 2C w ) 2 T σ-1 C α + C s C α T -σ 1 -σ + L p G θ 1 T T t=0 γ t α t 2 1 2 + 4C w 2 + 8 T T -1 t=0 E||∆w t || 2 Here, ∆ρ t = ρ t -ρ * t , ∆w t = w t -w * t . w * t and ρ * t are the optimal parameters given by TD(0) algorithm corresponding to policy parameter θ t . C α , σ are constants and γ t , α t are step-sizes defined in Assumption 3, ∥w t ∥ ≤ C w (Algorithm 2, step 8), C r is the upper bound on rewards (Assumption 5), Constant G θ is defined in Lemma 7. C s = L 2 p G 2 θ max t γ 2 t α 2 t + 4(C r + 2C w ) 2 . L p is Lipchitz constant defined in Lemma 14. Proof. ρ t+1 = ρ t + α t 1 M M -1 i=0 R π (s t,i ) -ρ t + ϕ π (s ′ t,i ) ⊺ wt -ϕ π (s t,i ) ⊺ wt ρ t+1 -ρ * t+1 = ρ t -ρ * t + ρ * t -ρ * t+1 + α t 1 M M -1 i=0 R π (s t,i ) -ρ t + ϕ π (s ′ t,i ) ⊺ wt -ϕ π (s t,i ) ⊺ wt = ρ t -ρ * t + ρ * t -ρ * t+1 + α t 1 M M -1 i=0 R π (s t,i ) -ρ * t + ϕ π (s ′ t,i ) ⊺ wt -ϕ π (s t,i ) ⊺ wt + α t (ρ * t -ρ t ) ρ t+1 -ρ * t+1 = ρ t -ρ * t + ρ * t -ρ * t+1 + α t (ρ * t -ρ t ) + α t 1 M M -1 i=0 (ϕ π (s ′ t,i ) -ϕ π (s t,i )) ⊺ ( wt -w t ) + α t 1 M M -1 i=0 (R π (s t,i ) -ρ * t + ϕ π (s ′ t,i ) ⊺ w t -ϕ π (s t,i ) ⊺ w t ) = ρ t -ρ * t + ρ * t -ρ * t+1 + α t (ρ * t -ρ t ) + α t 1 M M -1 i=0 (ϕ π (s ′ t,i ) -ϕ π (s t,i )) ⊺ ( wt -w t ) + α t l(B t , w t , θ t ) -l(w t , θ t ) + α t l(w t , θ t ) -l(w * t , θ t ) Here, l(B t , w t , θ t ) := 1 M M -1 i=0 (R π (s t,i ) -ρ * t + ϕ π (s ′ t,i ) ⊺ w t -ϕ π (s t,i ) ⊺ w t ) l(w t , θ t ) := S d π s, π(θ t ) R π (s) -ρ(π(θ t )) + ϕ π (s ′ ) ⊺ w t -ϕ π (s) ⊺ w t ds l(B t , ρ t , w t , θ t ) := (ρ * t -ρ t ) + 1 M M -1 i=0 (ϕ π (s ′ t,i ) -ϕ π (s t,i )) ⊺ ( wt w t ) + l(B t , w t , θ t ) -l(w t , θ t ) + l(w t , θ t ) -l(w * t , θ t ) ||∆ρ t+1 || 2 = ||∆ρ t + ρ * t + α t l(B t , w t , ρ t , θ t )|| 2 = ||∆ρ t || 2 + ||ρ * t -ρ * t+1 || 2 + α 2 t ||l(B t , wt , ρ t , θ t )|| 2 + 2⟨∆ρ t , ρ * t -ρ * t+1 ⟩ + 2α t ⟨∆ρ t , l(B t , wt , ρ t , θ t )⟩ + 2α t ⟨ρ * t -ρ * t+1 , l(B t , ρ t , wt , θ t )⟩ ≤ ||∆ρ t || 2 + 2||ρ * t -ρ * t+1 || 2 + 2α 2 t ||l(B t , wt , ρ t , θ t )|| 2 + 2⟨∆ρ t , ρ * t -ρ * t+1 ⟩ + 2α t ⟨∆ρ t , l(B t , wt , ρ t , θ t )⟩ E||∆ρ t+1 || 2 ≤ E||∆ρ t || 2 + 2E||ρ * t -ρ * t+1 || 2 1 ⃝ + 2α 2 t E||l(B t , wt , ρ t , θ t )|| 2 2 ⃝ + 2E⟨∆ρ t , ρ * t -ρ * t+1 ⟩ 3 ⃝ + 2α t E⟨∆ρ t1 -∆ρ t ⟩ 4 ⃝ + 2α t E⟨∆ρ t , 1 M M -1 i=0 (ϕ π (s ′ t,i ) -ϕ π (s t,i )) ⊺ ( wt -w t )⟩ 5 ⃝ + 2α t E⟨∆ρ t , l(B t , w t , θ t ) -l(w t , θ t )⟩ 6 ⃝ + 2α t E⟨∆ρ t , l(w t , θ t ) -l(w * t , θ t )⟩ 7 ⃝ (A.14) From equation A.14: 1 ⃝: E||ρ * t -ρ * t+1 || 2 ≤ L 2 p E||θ t+1 -θ t || 2 (Lemma 14) 2 ⃝: E||l(B t , ρ t , wt , θ t )|| 2 = E|| 1 M M -1 i=0 R π (s t,i ) -ρ t + ϕ π (s ′ t,i ) -ϕ π (s t,i ) ⊺ wt || 2 ≤ E 1 M M -1 i=0 (C r + C r + 2C w + 2C w ) 2 = 4(C r + 2C 2 ) 2 3 ⃝: E⟨∆ρ t , ρ * t -ρ * t+1 ⟩ ≤ E||∆ρ t || |ρ * t -ρ * t+1 | ≤ L p E|∆ρ t | ||θ t+1 -θ t || 4 ⃝: E⟨∆ρ t , -∆ρ t ⟩ = -E|∆ρ t | 2 ⃝: E⟨∆ρ t , 1 M M -1 i=0 ϕ π (s ′ t,i ) ⊺ -ϕ π (s t,i ) ⊺ ( wt -w t )⟩ ≤ E 1 M M -1 i=0 ||ϕ π (s ′ t,i ) -ϕ π (s t,i )|| || wt -w t || |∆ρ t | ≤ 4C w E|∆ρ t | 6 ⃝: E⟨∆ρ t , l(B t , w t , θ t ) -l(w t , θ t )⟩ = E⟨∆ρ t , E[l(B t , w t , θ t ) -l(w t , θ t )|∆ρ t ]⟩ Note: E[l(B t , w t , θ t ) -l(w t θ t )|∆ρ t ] E⟨∆ρ t , l(B t , w t , θ t ) -l(w t , θ t )⟩ = 0 7 ⃝: E[⟨∆ρ t , l(w t , θ t ) -l(w * t , θ t )⟩] = E[⟨∆ρ t , E[ϕ π (s ′ )] -ϕ π (s) ⊺ (w t -w * t )⟩] ≤ E[⟨ϕ π (s ′ ) -ϕ π (s), ∆w t ⟩|∆ρ t |] ≤ E[||ϕ π (s ′ ) -ϕ π (s)|| ||∆w t || |∆ρ t |] ≤ 2E(||∆w t || |∆ρ t |) Combining 1 ⃝-7 ⃝ into equation A.14: E||∆ρ t+1 || 2 ≤ (1 -2α t )E||∆ρ t || 2 + 2L 2 p E||θ t+1 -θ t || 2 + 8α 2 t (C r + 2C w ) 2 + 2L p E|∆ρ t | ||θ t+1 -θ t || + 8α t C w E|∆ρ t | + 4α t E||∆w t || |∆ρ t | =⇒ T -1 t=0 E||∆ρ t || 2 ≤ T -1 t=0 1 2α t E||∆ρ t || 2 -E||∆ρ t+1 || 2 1 ⃝ + T -1 t=0 L 2 p γ 2 t α t G 2 θ + 4α t (C r + 2C w ) 2 2 ⃝ + T -1 t=0 L p γ t α t G θ + 4C w E|∆ρ t | 3 ⃝ + T -1 t=0 2E||∆w t || |∆ρ t | 4 ⃝ (A.15) From equation A.15: 1 ⃝: 1 2 T -1 t=0 1 α t (E||∆ρ t || 2 -E||∆ρ t+1 || 2 ) = 1 2 T -1 t=0 1 α t - 1 α t-1 E|∆ρ t | 2 + 1 α 0 E|∆ρ 0 | 2 - 1 α T -1 E|∆ρ t | 2 ≤ 1 2 T -1 t=0 1 α t - 1 α t-1 + 1 α 0 4(C r + 2C w ) 2 ≤ 2(C r + 2C 2 ) 2 C α T σ 2 ⃝: T -1 t=0 L 2 p G 2 θ γ 2 t α t + 4α t (C r + 2C w ) 2 ≤ T -1 t=0 L 2 p G 2 θ max t γ 2 t α 2 t + 4(C r + 2C w ) 2 α t ≤ T -1 t=0 C s α t (C s = L 2 p G 2 θ max t γ 2 t α 2 t + 4(C r + 2C w ) 2 ) ≤ C s C α 1 -σ T 1-σ 3 ⃝: T -1 t=0 L p G θ γ t α t + 4C w E||∆ρ t || = T -1 t=0 L p G θ γ t α t E||∆ρ t || + 4C w T -1 t=0 E||∆ρ t || ≤ L p G θ T -1 t=0 γ t α t 2 1 2 T -1 t=0 E|∆ρ t | 2 1 2 + 4C w T -1 t=0 E|∆ρ t | 2 1 2 T 1 2 (using cauchy schwarz inequality) 4 ⃝: 2 T -1 t-0 E||∆w t || |∆ρ t | ≤ 2( T -1 t=0 E||∆w t || 2 ) 1 2 ( T -1 t=0 E|∆ρ t | 2 ) 1 2 (using cauchy schwarz inequality) Combining 1 ⃝-4 ⃝ into equation A.15 1 T T -1 t=0 E||∆ρ t || 2 ≤ 2(C r + 2C w ) 2 T σ-1 C α + C s C α T -σ 1 -σ + L p G θ 1 T T -1 t=0 γ t α t 2 1 2 1 T T -1 t=0 E|∆ρ t | 2 1 2 + 4C w 1 T t=0 T -1E|∆ρ t | 2 1 2 + 2 1 T T -1 t=0 E||∆w t || 2 1 2 1 T T -1 t=0 E|∆ρ t | 2 1 2 M (T ) = 1 T T -1 t=0 E||∆ρ t || 2 N (T ) = 1 T T -1 t=0 E||∆w t || 2 M (T ) ≤ K 1 + K 2 M (T ) + K 3 M (T ) N (T ) Here, K 1 = 2(C r + 2C 2 ) 2 T σ-1 C α + C s C α T -σ 1 -σ K 2 = L p G θ 1 T T -1 t=0 γ t α t 2 1 2 + 4C w K 3 = 2 From Lemma 4, we know that M (T ) ≤ 2( K 1 + K 2 ) 2 + 2K 2 3 N (T ) Hence, 1 T T -1 t=0 E|∆ρ t | 2 ≤ 2 2(C r + 2C w ) 2 C α T σ-1 + C s C α 1 -σ T -σ + L p G θ 1 T T -1 t=0 γ t α t 2 1 2 + 4C w 2 + 8 1 T T -1 t=0 E||∆w t || 2 Theorem 3. The on-policy average reward actor critic algorithm obtain ϵ-accurate optimal point with sample complexity of Ω(ϵ -2.5 ). min 0≤t≤T -1 E||∇ θ ρ(θ t )|| 2 = O 1 T 0.4 + O(1) min 0≤t≤T -1 E||∇ θ ρ(θ t )|| 2 ≤ ϵ + O(1) Proof. Using lemma 4 and lemma 5 we obtain, 1 T T -1 t=0 E||∆w t || 2 ≤ 2 2C 2 w λC α T σ-1 + C g C α 1 -σ T -σ + L w G θ λ 1 T T -1 t=0 ( γ t α t ) 2 1/2 + 2(C r + 3C w ) λ 2 + 4 λ 2 2(C r + 2C w ) 2 C α T σ-1 + C s C α 1 -σ T -σ + L p G θ 1 T T -1 t=0 ( γ t α t ) 2 1/2 + 4C w 2 + 16 λ 2 T T -1 t=0 E||∆w t || 2 =⇒ 1 T T -1 t=0 E||∆w t || 2 ≤ 2λ 2 λ 2 -16 2C 2 w λC α T σ-1 + C g C α 1 -σ T -σ + L w G θ λ 1 T T -1 t=0 ( γ t α t ) 2 1/2 + 2(C r + 3C w ) λ 2 1 ⃝ + 4 λ 2 -16 2(C r + 2C w ) 2 C α T σ-1 + C s C α 1 -σ T -σ + L p G θ 1 T T -1 t=0 ( γ t α t ) 2 1/2 + 4C w 2 2 ⃝ (A.16) From equation A.16 1 ⃝: 1 T T -1 t=0 ( γ t α t ) 2 ≤ 1 T T -1 t=0 ( C γ C α ) 2 1 (1 + t) 2(v-σ) ≤ T -2(v-σ) 1 -2(v -σ) ∵ T -1 t=0 1 1 + t v ≤ T 0 1 t v dt = T 1-v 1 -v w λC α T σ-1 + C g C α 1 -σ T -σ + L w G θ λ 1 T T -1 t=0 ( γ t α t ) 2 1/2 + 2(C r + 3C w ) λ 2 ≤ 2C 2 w λC α T σ-1 + C g C α 1 -σ T -σ + L w G θ λ T -2(v-σ) 1 -2(v -σ) 1/2 + 2(C r + 3C w ) λ 2 ≤ 3 2C 2 w λC α T σ-1 + C g C α 1 -σ T -σ + L w G θ λ 2 T -2(v-σ) 1 -2(v -σ) + 2(C r + 3C w ) λ 2 (∵ (a + b + c) 2 ≤ 3(a 2 + b 2 + c 2 )) = O 1 T 1-σ + O 1 T σ + O 1 T 2(v-σ) + O(1) 2 ⃝ (similar to 1 ⃝): 2(C r + 2C w ) 2 C α T σ-1 + C s C α 1 -σ T -σ + L p G θ 1 T T -1 t=0 ( γ t α t ) 2 1/2 + 4C w 2 = O 1 T 1-σ + O 1 T σ + O 1 T 2(v-σ) + O(1) Combining 1

⃝ and 2

⃝: 1 T T -1 t=0 E||∆w t )|| 2 = O 1 T 1-σ + O 1 T σ + O 1 T 2(v-σ) + O(1) (A.17) Using lemma 3 and equation A.17 1 T T -1 t=0 E||∇ θ ρ(θ t )|| 2 = O 1 T 1-v + O 1 T v + O 1 T 1-σ + O 1 T σ + O 1 T 2(v-σ) + O(1) =⇒ min 0≤t≤T -1 E||∇ θ ρ(θ t )|| 2 = O 1 T 1-v + O 1 T v + O 1 T 1-σ + O 1 T σ + O 1 T 2(v-σ) + O(1) ∵ min t E||∇ θ ρ(θ t )|| 2 ≤ 1 T T -1 t=0 E||∇ θ ρ(θ t )|| 2 By setting v = 3/5 and σ = 2/5, we obtain: min 0≤t≤T -1 E||∇ θ ρ(θ t )|| 2 = O 1 T 0.4 + O(1) O 1 T 0.4 ≤ ϵ Hence, the sample complexity of on-policy average reward actor-critic algorithm is Ω(ϵ -2.5 ). Lemma 6. The optimal critic parameter w(θ t ) * as a function of actor parameter θ t is Lipchitz continuous with constant L w . Note: w * t := w(θ t ) * . ||w * t -w * t+1 || ≤ L w ||θ t+1 -θ t || Proof. η is the l2-regularisation coefficient from Algorithm 2 and η > λ all max , where λ all max is defined in Lemma 11. Because of carefully setting the value of η, A(θ t ) is negative definite. Thus, for on-policy TD(0) with l2-regularization and target estimators, the following condition holds true for optimal critic parameter w * t : E[(R π (s) -ρ * t )ϕ π (s) + (ϕ π (s)(E[ϕ π (s ′ )] -ϕ π (s)) ⊺ -ηI)w * t ] = 0 b(θ t ) := E[(R π (s) -ρ * t )ϕ π (s)] A(θ t ) := E[(ϕ π (s)(E[ϕ π (s ′ )] -ϕ π (s)) ⊺ -ηI)] ∴ b(θ t ) + A(θ t )w * t = 0 =⇒ w * t = -A(θ t ) -1 b(θ t ) ||w * t -w * t+1 || = ||A(θ t ) -1 b(θ t ) -A(θ t+1 ) -1 b(θ t+1 )|| ≤ ||A(θ t ) -1 b(θ t ) -A(θ t+1 ) -1 b(θ t ) + A(θ t+1 ) -1 b(θ t ) -A(θ t+1 ) -1 b(θ t+1 )|| ≤ ||A(θ t ) -1 -A(θ t+1 ) -1 || ||b(θ t )|| 1 ⃝ + ||A(θ t+1 ) -1 || ||b(θ t ) -b(θ t+1 )|| 2 ⃝ (A.18) From equation A.18: 1 ⃝: ||A(θ t ) -1 -A(θ t+1 ) -1 || = ||A(θ t ) -1 A(θ t+1 )A(θ t+1 ) -1 -A(θ t ) -1 A(θ t ) A (θ t+1 ) -1 || ≤ ||A(θ t ) -1 || ||A(θ t ) -A(θ t+1 )|| ||A(θ t+1 ) -1 || (A.19) From equation A.19: Here, π ′ and π represents the policy with parameter θ t+1 and θ t respectively. ||A(θ t ) -A(θ t+1 )|| ≤ || d π ′ (s)(ϕ π ′ (s)( P π ′ (s ′ |s)ϕ π ′ (s ′ ) ds ′ -ϕ π ′ (s)) ⊺ -ηI) ds -d π (s)(ϕ π (s)( P π (s ′ |s)ϕ π (s ′ ) ds ′ -ϕ π (s)) ⊺ -ηI) ds|| ≤ || d π ′ (s)(ϕ π ′ (s)( P π ′ (s ′ |s)ϕ π ′ (s ′ ) ds ′ ) ⊺ ) ds -d π (s)(ϕ π (s)( P π (s ′ |s)ϕ π (s ′ ) ds ′ ) ⊺ ) ds|| 1 ⃝ ≤ || d π (s)(ϕ π (s)(ϕ π (s)) ⊺ ) ds -d π ′ (s)(ϕ π ′ (s)(ϕ π ′ (s)) ⊺ ) ds|| 2 ⃝ (A.20) From equation A.20: 1 ⃝: || d π ′ (s)(ϕ π ′ (s)( P π ′ (s ′ |s)ϕ π ′ (s ′ ) ds ′ ) ⊺ ) ds -d π (s)(ϕ π (s)( P π (s ′ |s)ϕ π (s ′ ) ds ′ ) ⊺ ) ds|| ≤ || (d π ′ (s) -d π (s))ϕ π ′ (s)( P π ′ (s ′ |s)ϕ π ′ (s ′ ) ds ′ ) ⊺ ds|| + || d π (s)(ϕ π ′ (s) -ϕ π (s))( P π ′ (s ′ |s)ϕ π ′ (s ′ ) ds ′ ) ⊺ ds|| + || d π (s)ϕ π (s)( (P π ′ (s ′ |s) -P π (s ′ |s))ϕ π ′ (s ′ ) ds ′ ) ⊺ ds|| + || d π (s)ϕ π (s)( P π (s ′ |s)(ϕ π ′ (s ′ ) -ϕ π (s ′ )) ds ′⊺ ) ds ≤ L d ||θ t+1 -θ t || (lemma 12) + L ϕ ||θ t+1 -θ t || (assumption 8) + L t ||θ t+1 -θ t || (assumption 9) + L ϕ ||θ t+1 -θ t || (assumption 8) || d π ′ (s)(ϕ π ′ (s)( P π ′ (s ′ |s)ϕ π ′ (s ′ ) ds ′ ) ⊺ ) ds -d π (s)(ϕ π (s)( P π (s ′ |s)ϕ π (s ′ ) ds ′ ) ⊺ ) ds|| ≤ (L d + L t + 2L ϕ )||θ t+1 -θ t || (A.21) From equation A.20: 2 ⃝: || d π (s)(ϕ π (s)(ϕ π (s)) ⊺ ) ds -d π ′ (s)(ϕ π ′ (s)(ϕ π ′ (s)) ⊺ ) ds|| ≤ || (d π (s) -d π ′ (s))ϕ π (s)(ϕ π (s)) ⊺ ds|| + || d π ′ (s)(ϕ π (s) -ϕ π ′ (s))(ϕ π (s)) ⊺ ds|| + || d π ′ (s)ϕ π ′ (s)(ϕ π (s) -ϕ π ′ (s)) ⊺ ds|| ≤ (L d + 2L ϕ )||θ t+1 -θ t || (A.22) Using equation A.21 and equation A.22 in equation A.20 ||A(θ t ) -A(θ t+1 )|| ≤ (2L d + 4L ϕ + L t )||θ t+1 -θ t || (A.23) From equation A.18: ⃝: ||b(θ t ) -b(θ t+1 )|| = || d π ′ (s)((R π ′ (s) -ρ * t+1 )ϕ π ′ (s) ds -d π (s)(R π (s) -ρ * t )ϕ π (s) ds|| ≤ || d π ′ (s)(R π ′ (s)ϕ π ′ (s) ds -d π (s)R π (s)ϕ π (s) ds|| + || d π ′ (s)ρ * t+1 ϕ π ′ (s) ds -d π (s)ρ * t ϕ π (s) ds|| ≤ || (d π ′ (s) -d π (s))R π ′ (s)ϕ π ′ (s) ds|| + || d π (s)(R π ′ (s) -R π (s))ϕ π ′ (s) ds|| + || d π (s)R π (s)(ϕ π ′ (s) -ϕ π (s) ds|| + || (d π ′ (s) -d π (s))ρ * t+1 ϕ π ′ (s) ds|| + || d π (s)(ρ * t+1 -ρ * t )ϕ π ′ (s) ds|| + || d π (s)ρ * t (ϕ π ′ (s) -ϕ π (s) ds|| ≤ C r L d ||θ t+1 -θ t || ( Assumption 5, Lemma 12) + L r ||θ t+1 -θ t || ( Assumption 10) + C r L ϕ ||θ t+1 -θ t || (Assumption 5,Assumption 8) + C r L d ||θ t+1 -θ t || ( Assumption 5, Lemma 12) + L p ||θ t+1 -θ t || (Lemma 14) + C r L ϕ ||θ t+1 -θ t || (Assumption 5,Assumption 8) =⇒ ||b(θ t ) -b(θ t+1 )|| ≤ (2L d C r + 2C r L ϕ + L r + L p )||θ t+1 -θ t || (A.24) Using equation A.19, equation A.23 and equation A.24 in equation A.18: ||w * t -w * t+1 || ≤ ||A(θ t ) -1 -A(θ t+1 ) -1 || ||b(θ t )|| + ||A(θ t+1 ) -1 || ||b(θ t ) -b(θ t+1 )|| ≤ ||A(θ t ) -1 || ||A(θ t ) -A(θ t+1 )|| ||A(θ t+1 ) -1 || ||b(θ t )|| + ||A(θ t+1 ) -1 || ||b(θ t ) -b(θ t+1 )|| ≤ (2L d + 4L ϕ + L t )||A(θ t ) -1 || ||A(θ t+1 ) -1 || ||b(θ t )|| ||θ t+1 -θ t || + (2L d C r + 2C r L ϕ + L r + L p )||A(θ t+1 ) -1 || ||θ t+1 -θ t || Note: • ||b(θ t )|| = || d π (s)(ϕ π (s)(ϕ π (s)) ⊺ ) ds|| ≤ C r (Using Assumption 5) • From Assumption 12, λ min is the lower bound on eigen values of A(θ) for all θ. ∴ ||w * t -w * t+1 || ≤ C r (2L d + 4L ϕ + L t ) λ 2 min ||θ t+1 -θ t || + (2L d C r + 2C r L ϕ + L r + L p ) λ min ||θ t+1 -θ t || ≤ L w ||θ t+1 -θ t || where, L w = C r (2L d + 4L ϕ + L t ) λ 2 min + (2L d C r + 2C r L ϕ + L r + L p ) λ min Lemma 7. Q w dif f is the approximate differential Q-value function parameterized by w. Then there exist a constant G θ , independent of policy parameter θ, such that: || 1 M M -1 i=0 ∇ a Q w dif f (s, a)| a=π(s) ∇ θ π(s)|| ≤ G θ Proof. ||Q w dif f (s, a 1 ) -Q w dif f (s, a 2 )|| ≤ L a ||a 1 -a 2 || (Assumption 6) =⇒ ||∇ a Q w dif f (s, a)|| ≤ L a =⇒ ||∇ a Q w dif f (s, a)| a=π(s) || ≤ L a (A.25) ||π(s, θ 1 ) -π(s, θ 2 )|| ≤ L π ||θ 1 -θ 2 || (Assumption 7) =⇒ ||∇ θ π(s)|| ≤ L π (A.26) Using equation A.25 and equation A.26: ∥ 1 M M -1 i=0 ∇ a Q w dif f (s, a)| a=π(s) ∇ θ π(s)∇ θ π(s)|| ≤ 1 M M -1 i=0 ||∇ a Q w dif f (s, a)| a=π(s) ∇ θ π(s)∇ θ π(s)|| ≤ L a L π = G θ Lemma 8. The average reward estimate ρ t is bounded. ∀t > 0 |ρ t | ≤ C r + 2C w Here, C w is the upper bound on critic parameter w t (Algorithm 2, step 8), C r is the upper bound on rewards (Assumption 5). Proof. |ρ 0 | ≤ C r + 2C w (Assumption 11) For t = 1: ρ 1 = ρ 0 + α 0 1 M M -1 i=0 R π (s 0,i ) + ϕ π (s ′ 0,i ) ⊺ w0 -ϕ π (s 0,i ) ⊺ w0 -ρ 0 = (1 -α 0 )ρ 0 + α 0 1 M M -1 i=0 R π (s 0,i ) + ϕ π (s ′ 0,i ) ⊺ w0 -ϕ π (s 0,i ) ⊺ w0 |ρ 1 | ≤ (1 -α 0 )|ρ 0 | + α 0 || 1 M M -1 i=0 R π (s 0,i ) + ϕ π (s ′ 0,i ) ⊺ w0 -ϕ π (s 0,i ) ⊺ w0 || ≤ (1 -α 0 )|ρ 0 | + α 0 1 M M -1 i=0 |R π (s 0,i )| + ||ϕ π (s ′ 0,i )|| || w0 || + ||ϕ π (s 0,i )|| || w0 || ≤ (1 -α 0 )(C r + 2C w ) + (α 0 )(C r + 2C w ) = (C r + 2C w ) (Assumption 11) Therefore the bound hold for t = 1. Let the bound hold for t = k. We will prove that the bound will also hold for k+1 ρ k+1 = ρ k + α k 1 M M -1 i=0 R π (s k,i ) + ϕ π (s ′ k,i ) ⊺ wk -ϕ π (s k,i ) ⊺ wk -ρ k = (1 -α k )ρ k + α k 1 M M -1 i=0 R π (s k,i ) + ϕ π (s ′ k,i ) ⊺ wk -ϕ π (s k,i ) ⊺ wk |ρ k+1 | ≤ (1 -α k )|ρ k | + α k || 1 M M -1 i=0 R π (s k,i ) + ϕ π (s ′ k,i ) ⊺ wk -ϕ π (s k,i ) ⊺ wk || ≤ (1 -α k )|ρ k | + α k 1 M M -1 i=0 |R π (s k,i )| + ||ϕ π (s ′ k,i )|| || wk || + ||ϕ π (s k,i )|| || wk || ≤ (1 -α k )(C r + 2C w ) + (α k )(C r + 2C w ) = (C r + 2C w ) The bound hold for t = k+1 as well. Hence by the principle of mathematical induction : ∀t > 0 |ρ t | ≤ C r + 2C w Lemma 9. The norm of target critic estimator wt is bounded ∀t > 0 || wt || ≤ C w Here, C w is the upper bound on critic parameter w t (Algorithm 2, step 8). Proof. For t=1: w1 = (1 -β 0 ) w0 + β 0 w 1 || w1 || ≤ (1 -β 0 )|| w0 || + β 0 ||w 1 || || w1 || ≤ (1 -β 0 )C w + β 0 C w (Assumption 11) || w1 || ≤ C w The bound hold for t=1. Let the bound hold for t = k. We will prove that the bound will also hold for k+1 wk+1 = (1 -β k ) wk + β k w k+1 || wk+1 || ≤ (1 -β k )|| wk || + β k ||w k+1 || || wk+1 || ≤ (1 -β k )C w + β k C w (Assumption 11) || wk+1 || ≤ C w The bound hold for t = k+1 as well. Hence by the principle of mathematical induction : ∀t > 0 || wt || ≤ C w Lemma 10. The norm of target average reward estimator ρt is bounded ∀t > 0 ||ρ t || ≤ C r + 2C w Here, C w is the upper bound on critic parameter w t (Algorithm 2, step 8), C r is the upper bound on rewards (Assumption 5). Proof. For t=1: ρ1 = (1 -β 0 )ρ 0 + β 0 ρ 1 ||ρ 1 || ≤ (1 -β 0 )||ρ 0 || + β 0 ||ρ 1 || ||ρ 1 || ≤ (1 -β 0 )(C r + 2C w ) + β 0 (C r + 2C w ) (Assumption 11) ||ρ 1 || ≤ C r + 2C w The bound hold for t=1. Let the bound hold for t = k. We will prove that the bound will also hold for k+1 ρk+1 = (1 -β k )ρ k + β k ρ k+1 ||ρ k+1 || ≤ (1 -β k )||ρ k || + β k ||ρ k+1 || ||ρ k+1 || ≤ (1 -β k )(C r + 2C w ) + β k (C r + 2C w ) (Assumption 11) ||ρ k+1 || ≤ C r + 2C w The bound hold for t = k+1 as well. Hence by the principle of mathematical induction : ∀t > 0 ||ρ t || ≤ C r + 2C w Lemma 11. The A(θ) matrix defined below is negative definite for all values of θ (θ is the policy parameter). A(θ) = d π (s)(ϕ π (s)( P π (s ′ |s)ϕ π (s ′ ) ds ′ -ϕ π (s)) ⊺ -ηI) ds ∀x x ⊺ A(θ)x ≤ -λ||x|| 2 , λ > 0 η is the l2-regularisation coefficient from Algorithm 2 and η > λ all max , where λ all max is defined in the proof below. Proof. Let: A ′ (θ) = d π (s)(ϕ π (s)( P π (s ′ |s)ϕ π (s ′ ) ds ′ -ϕ π (s)) ⊺ ) ds = A(θ) + ηI (A.27) Here, η is the l2-regularization coefficient from Algorithm 2. x ⊺ A ′ (θ)x = x ⊺ A ′ (θ) ⊺ + A ′ (θ) 2 x ≤ λ max (θ)||x|| 2 Here, A ′ (θ) ⊺ + A ′ (θ) 2 is a symmetric matrix and λ max (θ) is the maximum eigen value of the A ′ (θ) ⊺ + A ′ (θ) 2 . Using λ all max from Assumption 13: =⇒ x ⊺ A ′ (θ)x ≤ λ all max ||x|| 2 x ⊺ (A ′ (θ) -ηI)x ≤ (λ all max -η)||x|| 2 x ⊺ A(θ)x ≤ (λ all max -η)||x|| 2 (using A.37) Here, if we take η > λ all max then we can set λ = η -λ all max . =⇒ ∀x x ⊺ A(θ)x ≤ -λ||x|| 2 , λ > 0 Lemma 12. Let θ 1 and θ 2 be the policy parameter for π ′ and π respectively. d π ′ (•) and d π (•) be the stationary state distribution for π ′ and π respectively. Here, D T V denotes the total variation distance between two probability distribution function. We have: |d π ′ (s) -d π (s)| ds = 2D T V (d π ′ , d π ′ ) ≤ L d ||θ 1 -θ 2 || Here, L d = 2 m+1 (⌈log κ a -1 ⌉ + 1/κ)L t . L t is the Lipchitz constant for the transition probability density function (Assumption 9). Constants a and κ are from Assumption 2, m is the dimension of state space. Proof. |d π ′ (s) -d π (s)| ds = 2D T V (d π ′ , d π ) = 2D T V (µ 1 , µ 2 ) Let µ 1 and µ 2 be the stationary state probability measure for π ′ and π respectively. Then we have : dµ 1 = d π ′ (s) ds dµ 2 = d π (s) ds Using the result of Theorem 3.1 of Mitrophanov ( 2005): 2D T V (µ 1 , µ 2 ) ≤ 2 ⌈log κ a -1 ⌉ + 1 κ ||K 1 -K 2 || (A.28) where K 1 and K 2 are probability transition kernel for markov chain induced by policy π ′ and π. From equation A.28: ||K 1 -K 2 || ≤ sup ||g|| T V =1 || g(ds)(K 1 (•|s) -K 2 (•|s))|| T V || g(ds)(K 1 (•|s) -K 2 (•|s))|| T V ≤ sup |f |≤1 | f (s ′ )(K 1 -k 2 )(ds ′ |s)g(ds)| ≤ sup |f |≤1 | f (s ′ )(P π ′ (s ′ |s) -P π (s ′ |s))(s ′ |s)g(ds)ds ′ | ≤ sup |f |≤1 |f (s ′ )| |(P π ′ (s ′ |s) -P π (s ′ |s)|g(ds)ds ′ ≤ L t ||θ 1 -θ 2 || g(ds) ds ′ ≤ 2 m L t ||θ 1 -θ 2 || =⇒ ||K 1 -K 2 || ≤ 2 m L t ||θ 1 -θ 2 || (A.29) From equation A.28 and equation A.29:  |d π ′ (s) -d π (s)| ds = 2D T V (d π ′ , d π ′ ) ≤ 2 m+1 (⌈log κ a -1 ⌉ + 1 κ )L t ||θ 1 -θ 2 || ≤ L d ||θ 1 -θ 2 || ∇ θ ρ(π) = S d π (s)∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds = S d π (s)∇ a Q w dif f (s, a)| a=π(s) ∇ θ π(s, θ) ds = S d π (s)∇ θ π(s, θ)∇ θ π(s, θ) ⊺ w * ϵ ds = E[∇ θ π(s, θ)∇ θ π(s, θ) ⊺ ]w * ϵ Here, H θ = E[∇ θ π(s, θ)∇ θ π(s, θ) ⊺ ] ∇ θ ρ(π) = H θ w * ϵ =⇒ w * ϵ = H -1 θ ∇ θ ρ(π) =⇒ ||w * ϵ || ≤ ||H -1 θ || ||∇ θ ρ(π)|| By using Assumption 14, the lower bound on minimum eigenvalue of H θ for all θ is λ ϵ min and using Assumption 6 and 7 : ||w * ϵ || ≤ L a L π λ ϵ min = C w * ϵ Lemma 14. The average reward performance metric, defined in 3, ρ(π)(ρ(θ)) is Lipchitz continuous wrt to the policy (actor) parameter θ. ||ρ(θ 1 ) -ρ(θ 2 )|| ≤ L p ||θ 1 -θ 2 || Proof. Let θ 1 and θ 2 be the policy parameters of policy π ′ and π. ||ρ(θ 1 ) -ρ(θ 2 )|| = ||ρ(π ′ ) -ρ(π)|| = || S d π ′ (s)R π ′ (s) ds - S d π (s)R π (s) ds|| ≤ || S (d π ′ (s) -d π (s))R π ′ (s) ds|| + || S d π (s)(R π ′ (s) -R π (s)) ds|| ≤ L d ||θ 1 -θ 2 || (Lemma 12) + L r ||θ 1 -θ 2 || (Assumption 10) ≤ (L d + L r )||θ 1 -θ 2 || = L p ||θ 1 -θ 2 || (L d + L r = L p ) Lemma 15. The optimal critic parameter w(θ t ) * as a function of actor parameter θ t is Lipchitz continuous with constant L v for off-policy case. Note: w * t = w(θ t ) * . µ is the behaviour policy.  ||w * t -w * t+1 || ≤ L v ||θ t+1 -θ t || ⃝: ||b(θ t ) -b(θ t+1 )|| = || d µ (s)((R µ (s) -ρ * t+1 )ϕ π ′ (s) ds -d µ (s)(R µ (s) -ρ * t )ϕ π (s) ds|| ≤ || d µ (s)(R µ (s)ϕ π ′ (s) ds -d µ (s)R µ (s)ϕ π (s) ds|| + || d µ (s)ρ * t+1 ϕ π ′ (s) ds -d µ (s)ρ * t ϕ π (s) ds|| ≤ || d µ (s)(R µ (s) -R µ (s))ϕ π ′ (s) ds|| + || d µ (s)R µ (s)(ϕ π ′ (s) -ϕ π (s) ds|| + || d µ (s)(ρ * t+1 -ρ * t )ϕ π ′ (s) ds|| + || d µ (s)ρ * t (ϕ π ′ (s) -ϕ π (s) ds|| ≤ C r L ϕ ||θ t+1 -θ t || (Assumption 5) + L p ||θ t+1 -θ t || (Lemma 14) + C r L ϕ ||θ t+1 -θ t || (Assumption 5) =⇒ ||b(θ t ) -b(θ t+1 )|| ≤ (2C r L ϕ + L p )||θ t+1 -θ t || (A. ||w * t -w * t+1 || ≤ ||A(θ t ) -1 -A(θ t+1 ) -1 || ||b(θ t )|| + ||A(θ t+1 ) -1 || ||b(θ t ) -b(θ t+1 )|| ≤ ||A(θ t ) -1 || ||A(θ t ) -A(θ t+1 )|| ||A(θ t+1 ) -1 || ||b(θ t )|| + ||A(θ t+1 ) -1 || ||b(θ t ) -b(θ t+1 )|| ≤ 4L ϕ ||A(θ t ) -1 || ||A(θ t+1 ) -1 || ||b(θ t )|| ||θ t+1 -θ t || + (2C r L ϕ + L p )||A(θ t+1 ) -1 || ||θ t+1 -θ t || Note: • ||b(θ t )|| = || d µ (s)(ϕ π (s)(ϕ π (s)) ⊺ ) ds|| ≤ C r (Assumption 5) • Let λ min is the lower bound on eigen values of A(θ) for all θ. ∴ ||w * t -w * t+1 || ≤ C r (4L ϕ ) λ 2 min ||θ t+1 -θ t || + (2C r L ϕ + L p ) λ min ||θ t+1 -θ t || ≤ L v ||θ t+1 -θ t || where, L v = 4C r L ϕ λ 2 min + C r L ϕ λ min Lemma 16. The A µ of f (θ) matrix defined below is negative definite for all values of θ (θ is the policy parameter). θ µ is the policy parameter for behaviour policy µ. A µ of f (θ) := d µ (s)(ϕ π (s)( P π (s ′ |s)ϕ π (s ′ ) ds ′ -ϕ π (s)) ⊺ -ηI) ds ∀x x ⊺ A µ of f (θ)x ≤ -λ||x|| 2 , λ > 0 η is the l2-regularisation coefficient from Algorithm 3 and η > χ all max , where χ all max is defined in the proof below. Proof. Let: A µ of f ′ (θ) = d µ (s)(ϕ π (s)( P π (s ′ |s)ϕ π (s ′ ) ds ′ -ϕ π (s)) ⊺ ) ds = A µ of f (θ) + ηI (A.37) Here, η is the l2-regularization coefficient from Algorithm 2. x ⊺ A µ of f ′ (θ)x = x ⊺ A µ of f ′ (θ) ⊺ + A µ of f ′ (θ) 2 x ≤ χ max (θ)||x|| 2 Here, A µ of f ′ (θ) ⊺ + A µ of f ′ (θ) 2 is a symmetric matrix and χ max (θ) is the maximum eigen value of the A µ of f ′ (θ) ⊺ + A µ of f ′ (θ) . Using χ all max from Assumption 15: =⇒ x ⊺ A µ of f ′ (θ)x ≤ χ all max ||x|| 2 x ⊺ (A µ of f ′ (θ) -ηI)x ≤ (χ all max -η)||x|| 2 x ⊺ A µ of f (θ)x ≤ (χ all max -η)||x|| 2 (using A.37) Here, if we take η > χ all max then we can set λ = η -χ all max . =⇒ ∀x x ⊺ A µ of f (θ)x ≤ -λ||x|| 2 , λ > 0 Lemma 17. Let the cumulative error of off-policy actor be T -1 t=0 E|| ∇ θ ρ(θ t )|| 2 and cumulative error of critic be T -1 t=0 E||∆w t || 2 . θ t and w t are the actor and linear critic parameter at time t. θ µ is the policy parameter for behavior policy µ. Bound on the cumulative error of off-policy actor with behaviour policy µ is proven using cumulative error of critic as: 1 T T -1 t=0 E|| ∇ θ ρ(θ t )|| 2 ≤ 4 C r C γ T v-1 + 6C 4 π ( 1 T T -1 t=0 E||∆w t || 2 ) + 6C 4 π (τ 2 + 4 M C 2 w * ϵ ) + 2 C γ L J G 2 θ 1 -v T -v + Z T T -1 t=0 E||θ µ -θ t || 2 Here, C r is the upper bound on rewards (Assumption 5) , C γ , v are constants used for step size γ t (Assumption 3, ∥∇ θ π(s)∥ ≤ C π (Assumption 7), ∆w t = w t -w * t , τ = max t ∥w * t -w * ϵ,t ∥, w * ϵ is the optimal critic parameter according to Lemma2. w * t is the optimal parameters given by TD(0) algorithm corresponding to policy parameter θ t . Constant C w * ϵ is defined in Lemma 13. L J is the coefficient used in smoothness condition of the non convex function ρ(θ). Constant G θ is defined in Lemma 7. M is the size of batch of samples used to update parameters. Z = 2 m+1 C(⌈log κ a -1 ⌉ + 1/κ)L t with L t being the Lipchitz constant for the transition probability density function (Assumption 9). Constants a and κ are from Assumption 2, m is the dimension of state space, and C = max s ∥∇ a Q π dif f (s, a)| a=π(s) ∇ θ π(s, θ)∥. Proof. 1 T T -1 t=0 E|| ∇ θ ρ(θ t )|| 2 = 1 T T -1 t=0 E||∇ θ ρ(θ t ) + ∇ θ ρ(θ t ) -∇ θ ρ(θ t )|| 2 ≤ 1 T T -1 t=0 E||∇ θ ρ(θ t )|| 2 + 1 T T -1 t=0 E|| ∇ θ ρ(θ t ) -∇ θ ρ(θ t )|| 2 Using Theorem 2 and Lemma 3: 1 T T -1 t=0 E|| ∇ θ ρ(θ t )|| 2 ≤ 4 C r C γ T v-1 + 6C 4 π ( 1 T T -1 t=0 E||∆w t || 2 ) + 6C 4 π (τ 2 + 4 M C 2 w * ϵ ) + 2 C γ L J G 2 θ 1 -v T -v + Z T T -1 t=0 E||θ µ -θ t || 2 Theorem 4. The off-policy average reward actor critic algorithm (Algorithm 3) with behavior policy µ obtains an ϵ-accurate optimal point with sample complexity of Ω(ϵ -2.5 ). Here θ µ refers to the behavior policy parameter and θ t refers to the target or current policy parameter. We obtain min 0≤t≤T -1 E∥ ∇ θ ρ(θ t )∥ 2 = O 1 T 0.4 + O(1) + O(W 2 θ ) ≤ ϵ + O(1) + O(W 2 θ ) where W θ := max t ∥θ -θ t ∥. Proof. Lemma 4 and Lemma 5 will hold in the case of off-policy update. Lemma 4 will require Lemma 15 instead of Lemma 6. Using Lemma 4 and Lemma 5 and using the procedure followed in Theorem 3 to obtain asymptotic notations, we have: 1 T T -1 t=0 E||∆w t || 2 = O 1 T 1-σ + O 1 T σ + O 1 T 2(v-σ) + O(1) (A.38) Using Lemma 17 and equation A.38: min 0≤t≤T -1 E|| ∇ θ ρ(θ t )|| 2 = O 1 T 1-v + O 1 T v + O 1 T 1-σ + O 1 T σ + O 1 T 2(v-σ) + O(1) + Z T T -1 t=0 E||θ µ -θ t || 2 By setting v = 3/5 and σ = 2/5, we obtain: min 0≤t≤T -1 E|| ∇ θ ρ(θ t )|| 2 = O 1 T 0.4 + O(1) + Z T T -1 t=0 E||θ µ -θ t || 2 = O 1 T 0.4 + O(1) + ZN 2 θ = O 1 T 0.4 + O(1) + O(N 2 θ ). Further, O 1 T 0.4 ≤ ϵ. Hence, the sample complexity of off-policy average reward actor-critic algorithm is Ω(ϵ -2.5 ).

A.2 BOUNDEDNESS OF CRITIC PARAMETER

In this section we prove the critic parameter w used in Algorithm 2 and 3 is bounded even without using projection operator Γ Cw defined as Γ Cw : R k → B, where B(⊂ R k ) is a compact convex set. Let policy π is parameterized by θ. For simplicity of proof we are assuming the batch size M to be 1. Critic parameter w t ∈ R k , ϕ π (s) ∈ R k and ρ t is a scalar. Let the update rules used for critic parameter and average reward estimator be as follows: w t+1 = w t + α t R π (s t ) -ρt + ϕ π (s ′ t ) ⊺ wt -ϕ π (s t ) ⊺ w t ϕ π (s t ) -α t ηw t ρ t+1 = ρ t + α t R π (s t ) -ρ t + ϕ π (s ′ t ) ⊺ wt -ϕ π (s t ) ⊺ wt w t+1 = w t + β t (w t+1 -w t+1 ) ρ t+1 = ρ t + β t (ρ t+1 -ρ t+1 ) (A.39) Let us define z t as [w t ρ t ] ⊺ and zt as [ wt ρt ] ⊺ . 0 is a vector in R k and I 0 is an identity matrix in R (k+1)×(k+1) with I 0 [k][k] = 0 (assuming indexing starts from 0). Now, we will use the extension of stability criteria for iterates given Borkar & Meyn (2000) to two timescale stochastic approximation scheme (Lakshminarayanan & Bhatnagar, 2017)  w M 2 t+1 =0 λ(z t ) =(B ϕ + ηI 0 ) -1 (R π ϕ + A ϕ zt ) ϵ(n) =z t+1 -λ(z t ) λ(z t ) is the unique globally asymptotically stable equilibrium point of the ODE ż = h(z(t), z). λ used here has no relation to usage of λ in any other section of the paper. Using Lemma 1 of Chapter 6 of (Borkar, 2009) , we have ∥z t+1 -λ(z t )∥ → 0. Hence ϵ(n) = o(1). Therefore we can use the conclusion of (Lakshminarayanan & Bhatnagar, 2017) . We will now satisfy condition A1 till condition A5 of (Lakshminarayanan & Bhatnagar, 2017) to prove the boundedness of the critic parameter: Āϕ z if ( Bϕ + ηI 0 ) is positive definite matrix. Let C ϕ = S d π (s t )ϕ π (s t )ϕ π (s t ) ⊺ ds t . Bϕ + ηI 0 = C ϕ + ηI 0 0 ⊺ 1 [w ⊺ ρ] C ϕ + ηI 0 0 ⊺ 1 w ρ = w ⊺ (C ϕ + ηI)w + ρ 2 If η is strictly greater than negative of the minimum eigenvalue of C ϕ then, ∀ w p ̸ = 0 0 [w ⊺ ρ] C ϕ + ηI 0 0 ⊺ 1 w ρ > 0 ∀ w p ̸ = 0 0 [w ⊺ ρ] Bϕ + ηI 0 w ρ > 0 (A.44) Hence, for η + λ min (C ϕ ) > 0, Bϕ + ηI 0 is positive definite matrix. Therefore, the ODE ż(t) := h ∞ (z(t), z) has a unique globally asymptotically stable equilibrium point λ ∞ (z) and λ ∞ (0) = 0. Condition A4 is satisfied.

Condition A5:

Algorithm 2 On-policy AR-DPG with Linear FA Initialize actor parameter θ and critic parameters w. Initialize actor target parameter θ → θ. Initialize critic target parameters w → w.Initialize average reward parameter ρ Initialize target average reward parameter ρ → ρ Initialize buffer = {} 1: t = 0, s 0 = env.reset() 2: while t ≤ total steps do 3:  a t = w t+1 = Γ Cw w t + α t M M -1 i=0 R µ (s i ) -ρt + ϕ π (s ′ i ) ⊺ wt -ϕ π (s i ) ⊺ w t ϕ π (s i ) -α t ηw t 8: ρ t+1 = ρ t + α t M M -1 i=0 R µ (s i ) -ρ t + ϕ π (s ′ i ) ⊺ wt -ϕ π (s i ) ⊺ wt 9: w t+1 = w t + β t (w t+1 -w t+1 ) 10: ρ t+1 = ρ t + β t (ρ t+1 -ρ t+1 ) 11:  θ t+1 = θ t + γ t M M -1 i=0 ∇ a Q w dif f (s i ,



Figure 1: Comparison of performance of different average reward algorithms

Condition A1: ∥h(z 1 , z1 ) -h(z 2 , z2 )∥ = ∥ Āϕ (z 1 -z2 ) -( Bϕ + ηI 0 )(z 1 -z 2 )∥ ≤ ∥ Āϕ ∥∥z 1 -z2 ∥ + ∥ Bϕ + ηI 0 ∥∥z 1 -z 2 ∥ ≤ max(∥ Āϕ ∥, ∥ Bϕ + ηI 0 ∥)(∥z 1 -z2 ∥ + ∥∥z 1 -z 2 ∥) = L h (∥z 1 -z2 ∥ + ∥∥z 1 -z 2 ∥) (L h = max(∥ Āϕ ∥, ∥ Bϕ + ηI 0 ∥)) (A.42) Therefore, h(z, z) is Lipchitz continuous with constant L h . ∥g(z 1 , z1 ) -g(z 2 , z2 )∥ = ∥(( Bϕ + ηI 0 )A ϕ -I)(z 1 -z2 )∥ ≤ ∥(( Bϕ + ηI 0 )A ϕ -I)∥∥z 1 -z2 ∥ = L g ∥z 1 -z2 ∥ (L g = ∥(( Bϕ + ηI 0 )A ϕ -I)∥) (A.43)

Āϕ zt -c( Bϕ + ηI 0 )z t c lim c→∞ h c (z, z) = lim c→∞ Rπ ϕ + c Āϕ zt -c( Bϕ + ηI 0 )z t c = Āϕ zt -( Bϕ + ηI 0 )z tLet us define h ∞ (z t , zt ) := Āϕ zt -( Bϕ + ηI 0 )z t .The ODE ż(t) := h ∞ (z(t), z) has a unique globally asymptotically stable equilibrium point λ ∞ (z) = ( Bϕ + ηI 0 ) -1

The state feature mapping (ϕ π (s) = ϕ(s, π(s)) defined for a policy π with parameter θ is Lipschitz continuous w.r.t θ. Thus, max s ∥ϕ π1 (s) -ϕ π2 (s)∥ ≤ L ϕ ∥θ 1 -θ 2 ∥.

Yue Frank Wu, Weitong ZHANG, Pan Xu, and Quanquan Gu. A finite-time analysis of two timescale actor-critic methods. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 17617-17628. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ cc9b3c69b56df284846bf2432f1cba90-Paper.pdf. Huaqing Xiong, Tengyu Xu, Lin Zhao, Yingbin Liang, and Wei Zhang. Deterministic policy gradient: Convergence analysis. In The 38th Conference on Uncertainty in Artificial Intelligence, 2022. Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gendice: Generalized offline estimation of stationary values. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020a. URL https: //openreview.net/forum?id=HkxlcnVFwB. Shangtong Zhang, Bo Liu, and Shimon Whiteson. Gradientdice: Rethinking generalized offline estimation of stationary values. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 11194-11203. PMLR, 2020b. URL http://proceedings.mlr.press/ v119/zhang20r.html.

Lemma 13. The optimal critic parameter w * ϵ according to compatible function approximation Lemma (2) is bounded by constant C w * ϵ .

(s t )ϕ π (s ′ t ) ⊺ -ϕ π (s t ) ϕ π (s ′ t ) ⊺ -ϕ π (s t ) ⊺ (s t ), A ϕ (s t , s ′ t ) = ϕ π (s t )ϕ π (s ′ t ) ⊺ -ϕ π (s t ) ϕ π (s ′ t ) ⊺ -ϕ π (s t ) ⊺ 0 and B ϕ (s t ) = ϕ π (s t )ϕ π (s t ) ⊺ 0 0 ⊺ 1 z t+1 = z t + α t R π ϕ (s t ) + A ϕ (s t , s ′ t )z t -(B ϕ (s t ) + ηI 0 )z t zt+1 = zt + β t (z t+1 -zt ) (A.41)

to show the boundedness of the critic parameter and average reward estimator together. Let us write A.41 in the standard form of stochastic approximation scheme.z t+1 = z t + α t h(z t , zt ) + M 1 t+1 Let, Rπ ϕ = S d π (s t )R π ϕ (s t ) ds t , Āϕ = S d π (s t ) S P π (s ′ t |s t )A ϕ (s t , s ′ t ) ds ′ t ds t , Bϕ = S d π (s t )B ϕ (s t ) s t Here, h(z t , zt ) = S d π (s t ) R π ϕ (s t ) + A ϕ (s t , s ′ t )z t -(B ϕ (s t ) + ηI 0 )z t ds t = Rπ ϕ + Āϕ zt -( Bϕ + ηI 0 )z t M 1 t+1 =R π ϕ (s t ) + A ϕ (s t , s ′ t )z t -(B ϕ (s t ) + ηI 0 )z t -h(z t , zt ) zt+1 = zt + β t g(z t , zt ) + M 2 t+1 + ϵ(n) Here,g(z t , zt ) =λ(z t ) -zt

Therefore, g(z, z) is Lipchitz continuous with constant L g . Using A.42 and A.43, condition A1 is satisfied. Let us define an increasing sequence of σ-fields{F t } as {z m , zm , M 1 m , M 2 m , m ≤ t}.Hence, {M 1 t } and {M 2 t } are martingale difference sequence.∥M 1 t+1 ∥ 2 = ∥(R π ϕ (s t ) -Rϕ ) + (A ϕ (s t , s ′ t ) -Āϕ )z t -(B ϕ (s t ) -Bϕ (s t ))z t ∥ 2 ≤ 3(∥R π ϕ (s t ) -Rϕ ∥ 2 + ∥(A ϕ (s t , s ′ t ) -Āϕ )∥ 2 ∥z t ∥ 2 + ∥B ϕ (s t ) -Bϕ (s t ))∥ 2 ∥z t ∥ 2 ) ≤ K 1 (1 + ∥z t ∥ 2 + ∥z t ∥ 2 ) Here, K 1 = 6 max(∥R π ϕ (s t )∥, ∥(A ϕ (s t , s ′ t )∥, ∥B ϕ (s t )∥) and it follows from Assumption 4 and 5. We have,E[∥M 1 t+1 ∥ 2 ||F t ] ≤ K 1 (1+∥z t ∥ 2 +∥z t ∥ 2 ) and E[∥M 2 t+1 ∥ 2 ||F t ] ≤ K 2 (1+∥z t ∥ 2 +∥z t ∥ 2 ). K 2 can be any positive constant. Hence condition A2 is satisfied. Cα (1+t) σ ) 2 + ( C β (1+t) u ) 2 < ∞.We can carefully set the value of σ and u to satisfy the conditions on step sizes.

π(s t ) + ϵ {ϵ is the noise} 4: s t+1 ∼ P (•|s t , a t ) and r t = R(s t , a t )Store {s t , a t , s t+1 } in the Buffer π (s i ) -ρt + ϕ π (s ′ i ) ⊺ wt -ϕ π (s i ) ⊺ w t ϕ π (s i ) -α t ηw t π (s i ) -ρ t + ϕ π (s ′ i ) ⊺ wt -ϕ π (s i ) ⊺ wt 10: w t+1 = w t + β t (w t+1 -w t+1 ) 11: ρ t+1 = ρ t + β t (ρ t+1 -ρ t+1 ) i=0 ∇ a Q w dif f (s i , a)| a=π(si) ∇ θ π(s i ) Off-policy AR-DPG with Linear FA Initialize actor parameter θ and critic parameters w Initialize actor target parameter θ → θ and Initialize critic target parameters w → w Initialize average reward parameter ρ and Initialize target average reward parameter ρ → ρ µ is the behavior policy Initialize Replay buffer = {} 1: t = 0, s 0 = env.reset() 2: while t ≤ total steps do 3: a t = µ(s t ) + ϵ {ϵ is the noise} 4: s t+1 ∼ P (•|s t , a t ) and r t = R(s t , a t ) Store {s t , a t , s t+1 } in the Replay Buffer 6: Sample B t = {s i , a i , s ′ i } M -1 i=0 from the Replay Buffer

a)| a=π(si) ∇ θ π(s i )

annex

Proof. η is the l2-regularisation coefficient from Algorithm 3 and η > χ all max , where χ all max is defined in Lemma 16. Because of carefully setting the value of η, A(θ t ) is negative definite. Thus, for on-policy TD(0) with l2-regularization and target estimators, the following condition holds true for optimal critic parameter w * t :Expectation above is with respect to stationary state distribution d µ (•) of policy µ. Please note the abuse of notation here, A(θ t ) is actually same as A µ of f (θ t ) of Lemma 16.From equation A.30:From equation A.31:Here, π ′ and π represents the policy with parameter θ t+1 and θ t respectively and µ be the behaviour policy .Using equation A.33 and equation A.34 in equation A.32From equation A.30:) has origin as its unique globally asymptotically stable equilibrium if I -( Bϕ + ηI 0 ) -1 A ϕ is positive definite matrix.∥ • ∥ refers to L2-norm. λ i are the eigenvalues of the matrix C ϕ . Let us assume the following:A ϕ is positive definite matrix. Therefore, the ODE ż(t) = g ∞ (z(t)) has origin as its unique globally asymptotically stable. Condition A5 is satisfied.Let us the consider the ODE ż(t) = h(z(t), z). Here, h(z(t), z) = Rπ ϕ + Āϕ z -( Bϕ + ηI 0 )z t As earlier, for η + λ min (C ϕ ) > 0, Bϕ + ηI 0 is positive definite matrix. Therefore, the ODE ż(t) := h(z(t), z) has a unique globally asymptotically stable equilibrium point λ(z) = (B ϕ + ηI 0 ) -1 (R π ϕ + A ϕ zt ). Conditions A1 to A5 are satisfied, therefore sup t ∥z t ∥ < ∞, which implies iterates are bounded. Hence critic parameter w t is bounded.

B ALGORITHM AND HYPERPARAMETERS

B.1 (OFF-POLICY) ARO-DDPG PRACTICAL ALGORITHM Algorithm 1 (Off-Policy) ARO-DDPG Practical Algorithm Initialize actor parameter θ and critic parameters w 1 , w 2 .Initialize actor target parameter θ → θ Initialize critic target parameters w 1 → w 1 , w 2 → w 2 . Initialize average reward parameter ρ. Initialize target average reward parameter ρ → ρ. Initialize Replay buffer = {} 1: t = 0, s 0 = env.reset() 2: while t ≤ total steps do 3: 

