AVERAGE REWARD REINFORCEMENT LEARNING WITH MONOTONIC POLICY IMPROVEMENT Anonymous

Abstract

In continuing control tasks, an agent's average reward per time step is a more natural performance measure compared to the commonly used discounting framework since it can better capture an agent's long-term behavior. We derive a novel lower bound on the difference of the long-term average reward for two policies. The lower bound depends on the average divergence between the policies and on the so-called Kemeny constant, which measures to what degree the unichain Markov chains associated with the policies are well-connected. We also show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al., 2017) results in a non-meaningful lower bound in the average reward setting. Based on our lower bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. When combined with Deep Reinforcement Learning (DRL) methods, the procedure leads to scalable and efficient algorithms for maximizing the agent's average reward performance. Empirically we demonstrate the effectiveness of our method on continuing control tasks and show how discounting can lead to unsatisfactory performance.

1. INTRODUCTION

The goal of Reinforcement Learning (RL) is to build agents that can learn high-performing behaviors through trial-and-error interactions with the environment. Broadly speaking, modern RL tackles two kinds of problems: episodic tasks and continuing tasks. In episodic tasks, the agent-environment interaction can be broken into separate distinct episodes, and the performance of the agent is simply the sum of the rewards accrued within an episode. Examples of episodic tasks include training an agent to learn to play Go (Silver et al., 2016; 2018) or Atari video games (Mnih et al., 2013) , where the episode terminates when the game ends. In continuing tasks, such as controlling robots with long operating lifespans (Peters & Schaal, 2008; Schulman et al., 2015; Haarnoja et al., 2018) , there is no natural separation of episodes and the agent-environment interaction continues indefinitely. The performance of an agent in a continuing task is more difficult to quantify since even for bounded reward functions, the total sum of rewards is typically infinite. One way of making the long-term reward objective meaningful for continuing tasks is to apply discounting, i.e., we maximize the discounted sum of rewards r 0 + γr 1 + γ 2 r 2 + • • • for some discount factor γ ∈ (0, 1). This is guaranteed to be finite for any bounded reward function. However the discounted objective biases the optimal policy to choose actions that lead to high near-term performance rather than to high long-term performance. Such an objective -while useful in certain applications -is not appropriate when the goal is optimize long-term behavior. As argued in Chapter 10 of Sutton & Barto (2018) and in Naik et al. (2019) , a more natural objective is to use the average reward received by an agent over every time-step. While the average reward setting has been extensively studied in the classical Markov Decision Process literature (Howard, 1960; Blackwell, 1962; Veinott, 1966; Bertsekas et al., 1995) , it is much less commonly used in reinforcement learning. An important open question is whether recent advances in RL for the discounted reward criterion can be naturally generalized to the average reward setting. One major source of difficulty with modern DRL algorithms lies in controlling the step-size for policy updates. In order to have better control over step-sizes, Schulman et al. ( 2015) constructed a lower bound on the difference between the expected discounted return for two arbitrary policies π and π . The bound is a function of the divergence between these two policies and the discount factor.

