OFF-POLICY AVERAGE REWARD ACTOR-CRITIC WITH DETERMINISTIC POLICY SEARCH

Abstract

The average reward criterion is relatively less explored as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this paper, we present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion. Using these theorems, we also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We show a finite time analysis of the resulting threetimescale stochastic approximation scheme with linear function approximator and obtain an ϵ-optimal stationary policy with a sample complexity of Ω(ϵ -2.5 ). We compare the average reward performance of our proposed algorithm and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo based environments.

1. INTRODUCTION

The reinforcement learning (RL) paradigm has shown significant promise for finding solutions to decision making problems that rely on a reward-based feedback from the environment. Here one is mostly concerned with the long-term reward acquired by the algorithm. In the case of infinite horizon problems, the discounted reward criterion has largely been studied because of its simplicity. Major recent development in the context of RL in continuous state-action spaces has considered the discounted reward criterion (Schulman et al., 2015; 2017; Lillicrap et al., 2016; Haarnoja et al., 2018) . However, there are very few works which focus on the average reward performance criterion in the continuous state-action setting (Zhang & Ross, 2021; Ma et al., 2021) . The average reward criterion has started receiving attention in recent times and there are papers that discuss the benefits of using this criterion over the discounted reward (Dewanto & Gallagher, 2021; Naik et al., 2019) . One of the reasons being, average reward criteria only considers recurrent states and it happens to be the most selective optimization criterion in recurrent Markov Decision Processes (MDPs) according to n-discount optimality criterion. Please refer Mahadevan (1996) for more details on n-discount optimality criterion. Further, optimization in average reward setting is not dependent on the initial state distribution. Moreover, the discrepancy between the objective function and the evaluation metric, that exists for discounted reward setting, is resolved by opting for the average reward criterion. We encourage the readers to go through Dewanto & Gallagher (2021); Naik et al. ( 2019) for better understanding of the benefits mentioned. There are very few algorithms in literature that optimize the average reward and all of them happen to be on-policy algorithms (Zhang & Ross, 2021; Ma et al., 2021) . It has been demonstrated several times that on-policy algorithms are less sample efficient than off-policy algorithms Lillicrap et al. (2016); Haarnoja et al. (2018); Fujimoto et al. (2018) for the discounted reward criterion. In this paper we try to find whether the same is true for the average reward criterion. We try to overcome the research gap in development of off-policy average reward algorithms for continuous state and action spaces by proposing an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. The policy evaluation step in the case of the average reward algorithm is equivalent to finding the solution to the Poisson equation (i.e., the Bellman equation for a given policy). Poisson equation, because of its form, does not admit a unique solution but only solutions that are unique up to a constant term. Further, the policy evaluation step in this case consists of finding not just the Differential Q-value function but also the average reward. Thus, because of the required estimation of two quantities instead of one, the role of the optimization algorithm and the target network increases here. Therefore we implement the proposed ARO-DDPG algorithm by using target network and by carefully selecting the optimization algorithm. The following are the broad contributions of our paper: • We provide both on-policy and off-policy deterministic policy gradient theorems for the average reward performance metric. • We present our Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) algorithm. • We perform non-asymptotic convergence analysis and provide a finite time analysis of our three timescale stochastic approximation based actor-critic algorithm using a linear function approximator. • We show the results of implementations using our algorithm with other state-of-the-art algorithms in the literature. The rest of the paper is structured as follows: In Section 2, we present the preliminaries on the MDP framework, the basic setting as well as the policy gradient algorithm. Section 3 presents the deterministic policy gradient theorem and our algorithm. Section 4 then presents the main theoretical results related to the finite time analysis. Section 5 presents the experimental results. In Section 6, we discuss other related work and Section 7 presents the conclusions. The detailed proofs for the finite time analysis are available in the Appendix. Assumption 1 is necessary to ensure existence of steady state distribution of Markov process.

2.1. DISCOUNTED REWARD MDPS

In discounted reward MDPs, discounting is controlled by γ ∈ (0, 1). The following performance metric is optimized with respect to the policy: η(π) = E π [ ∞ t=0 γ t R(s t , a t )] = S ρ 0 (s)V π (s) ds. Here, ρ 0 is the initial state distribution and V π is the value function. V π (s) denotes the long term reward acquired when starting in the state s. V π (s t ) = E π [R(s t , a t ) + γV π (s t+1 )|s t ]. (2)

2.2. AVERAGE REWARD MDPS

The performance metric in the case of average reward MDPs is the long-run average reward ρ(π) defined as follows: ρ(π) = lim N →∞ 1 N E π [ N -1 t=0 R(s t , a t )] = S d π (s)R π (s) ds,



Consider a Markov Decision Process (MDP) M = {S, A, R, P, π} where S ⊂ R n is the (continuous) state space, A ⊂ R m is the (continuous) action space, R : S × A → R denotes the reward function with R(s, a) being the reward obtained under state s and action a. Further, P (•|s, a) denotes the state transition function defined as P : S × A → µ(•), where µ : B(S) → [0, 1] is a probability measure. Deterministic policy π is defined as π : S → A.In the above, B(S) represents the Borel sigma algebra on S. Stochastic policy π r is defined as π r : S → µ ′ (•), where µ ′ : B(A) → [0, 1] and B(A) is the Borel sigma algebra on A. Assumption 1. The Markov process obtained under any policy π is ergodic.

