AVERAGE REWARD REINFORCEMENT LEARNING WITH MONOTONIC POLICY IMPROVEMENT Anonymous

Abstract

In continuing control tasks, an agent's average reward per time step is a more natural performance measure compared to the commonly used discounting framework since it can better capture an agent's long-term behavior. We derive a novel lower bound on the difference of the long-term average reward for two policies. The lower bound depends on the average divergence between the policies and on the so-called Kemeny constant, which measures to what degree the unichain Markov chains associated with the policies are well-connected. We also show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al., 2017) results in a non-meaningful lower bound in the average reward setting. Based on our lower bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. When combined with Deep Reinforcement Learning (DRL) methods, the procedure leads to scalable and efficient algorithms for maximizing the agent's average reward performance. Empirically we demonstrate the effectiveness of our method on continuing control tasks and show how discounting can lead to unsatisfactory performance.

1. INTRODUCTION

The goal of Reinforcement Learning (RL) is to build agents that can learn high-performing behaviors through trial-and-error interactions with the environment. Broadly speaking, modern RL tackles two kinds of problems: episodic tasks and continuing tasks. In episodic tasks, the agent-environment interaction can be broken into separate distinct episodes, and the performance of the agent is simply the sum of the rewards accrued within an episode. Examples of episodic tasks include training an agent to learn to play Go (Silver et al., 2016; 2018) or Atari video games (Mnih et al., 2013) , where the episode terminates when the game ends. In continuing tasks, such as controlling robots with long operating lifespans (Peters & Schaal, 2008; Schulman et al., 2015; Haarnoja et al., 2018) , there is no natural separation of episodes and the agent-environment interaction continues indefinitely. The performance of an agent in a continuing task is more difficult to quantify since even for bounded reward functions, the total sum of rewards is typically infinite. One way of making the long-term reward objective meaningful for continuing tasks is to apply discounting, i.e., we maximize the discounted sum of rewards r 0 + γr 1 + γ 2 r 2 + • • • for some discount factor γ ∈ (0, 1). This is guaranteed to be finite for any bounded reward function. However the discounted objective biases the optimal policy to choose actions that lead to high near-term performance rather than to high long-term performance. Such an objective -while useful in certain applications -is not appropriate when the goal is optimize long-term behavior. As argued in Chapter 10 of Sutton & Barto (2018) and in Naik et al. (2019) , a more natural objective is to use the average reward received by an agent over every time-step. While the average reward setting has been extensively studied in the classical Markov Decision Process literature (Howard, 1960; Blackwell, 1962; Veinott, 1966; Bertsekas et al., 1995) , it is much less commonly used in reinforcement learning. An important open question is whether recent advances in RL for the discounted reward criterion can be naturally generalized to the average reward setting. One major source of difficulty with modern DRL algorithms lies in controlling the step-size for policy updates. In order to have better control over step-sizes, Schulman et al. ( 2015) constructed a lower bound on the difference between the expected discounted return for two arbitrary policies π and π . The bound is a function of the divergence between these two policies and the discount factor. Schulman et al. (2015) showed that iteratively maximizing this lower bound generates a sequence of monotonically improved policies in terms of their discounted return. In this paper, we first show that the policy improvement theorem from Schulman et al. (2015) results in a non-meaningful bound in the average reward case. We then derive a novel result which lower bounds the difference of the average rewards based on the divergence of the policies. The bound depends on the average divergence between the policies and on the so-called Kemeny constant, which measures to what degree the unichain Markov chains associated with the policies are wellconnected. We show that iteratively maximizing this lower bound guarantees monotonic average reward policy improvement. Similar to the discounted case, the problem of maximizing the lower bound can be approximated with DRL algorithms which can be optimized using samples collected in the environment. We describe in detail two such algorithms: Average Reward TRPO (ATRPO) and Average Cost CPO (ACPO), which are average reward versions of algorithms based on the discounted criterion (Schulman et al., 2015; Achiam et al., 2017) . Using the MuJoCo simulated robotic benchmark, we carry out extensive experiments with the ATRPO algorithm and show that it is more effective than their discounted counterparts for these continuing control tasks. To our knowledge, this is one of the first paper to address DRL using the long-term average reward criterion.

2. PRELIMINARIES

Consider a Markov Decision Process (MDP) (Sutton & Barto, 2018) (S, A, P, r, µ) where the state space S and action space A are assumed to be finite. The transition probability is denoted by P : S × A × S → [0, 1], the bounded reward function r : S × A → [r min , r max ], and µ : S → [0, 1] is the initial state distribution. Let π = {π(a|s) : s ∈ S, a ∈ A} be a stationary policy, and Π is the set of all stationary policies. Here we discuss the two objective formulations for continuing control tasks: the average reward approach and discounted reward approach.

Average Reward Approach

In this paper, we will focus exclusively on unichain MDPs, which is when the Markov chain corresponding to every policy contains only one recurrent class and a finite but possibly empty set of transient states. The average reward objective is defined as: ρ(π) := lim N →∞ 1 N E τ ∼π N -1 t=0 r(s t , a t ) = E s∼dπ a∼π [r(s, a)]. Here d π (s) := lim N →∞ 1 N N -1 t=0 P (s t = s|π) = lim t→∞ P (s t = s|π) is the stationary state distribution under policy π, τ = (s 0 , a 0 , . . . , ) is a sample trajectory. We use τ ∼ π to indicate that the trajectory is sampled from policy π, i.e. s 0 ∼ µ, a t ∼ π(•|s t ), and s t+1 ∼ P (•|s t , a t ). In the unichain case, the average reward ρ(π) is state-independent for any policy π (Bertsekas et al., 1995) . We express the average-reward value function as V π (s) := E τ ∼π ∞ t=0 (r(s t , a t ) -ρ(π)) s 0 = s and action-value function as Q π (s, a) := E τ ∼π ∞ t=0 (r(s t , a t ) -ρ(π)) s 0 = s, a 0 = a . We define the average reward advantage function as A π (s, a) := Q π (s, a) -V π (s).

Discounted Reward Approach

For some discount factor γ ∈ (0, 1), the discounted reward objective is defined as ρ γ (π) := E τ ∼π ∞ t=0 γ t r(s t , a t ) = 1 1 -γ E s∼dπ,γ a∼π [r(s, a)]. where d π,γ (s) := (1 -γ) ∞ t=0 γ t P (s t = s|π) is known as the future discounted state visitation distribution under policy π. Note that unlike the average reward objective, the discounted objective depends on the initial state distribution µ. It can be easily shown that d π,γ (s) → d π (s) for all s as γ → 1. The discounted value function is defined as V π γ (s) := E τ ∼π ∞ t=0 γ t r(s t , a t ) s 0 = s and discounted action-value function Q π γ (s, a) := E τ ∼π ∞ t=0 γ t r(s t , a t ) s 0 = s, a 0 = a . Finally, the discounted advantage function is defined as A π γ (s, a) := Q π γ (s, a) -V π γ (s).

