BEYOND PRIORITIZED REPLAY: SAMPLING STATES IN MODEL-BASED RL VIA SIMULATED PRIORITIES

Abstract

The prioritized Experience Replay (ER) method has attracted great attention; however, there is little theoretical understanding of such prioritization strategy and why they help. In this work, we revisit prioritized ER and, in an ideal setting, show equivalence to minimizing cubic loss, providing theoretical insight into why it improves upon uniform sampling. This theoretical equivalence highlights two limitations of current prioritized experience replay methods: insufficient coverage of the sample space and outdated priorities of training samples. This motivates our model-based approach, which does not suffer from these limitations. Our key idea is to actively search for high priority states using gradient ascent. Under certain conditions, we prove that the hypothetical experiences generated from these states are sampled proportionally to approximately true priorities. We also characterize the distance between the sampling distribution of our method and the true prioritized sampling distribution. Our experiments on both benchmark and application-oriented domains show that our approach achieves superior performance over baselines.

1. INTRODUCTION

Using hypothetical experience simulated from an environment model can significantly improve sample efficiency of RL agents (Ha & Schmidhuber, 2018; Holland et al., 2018; Pan et al., 2018; Janner et al., 2019; van Hasselt et al., 2019) . Dyna (Sutton, 1991) is a classical MBRL architecture where the agent uses real experience to updates its policy as well as its reward and dynamics models. In-between taking actions, the agent can get hypothetical experience from the model to further improve the policy. An important question for effective Dyna-style planning is search-control: from what states should the agent simulate hypothetical transitions? On each planning step in Dyna, the agent has to select a state and action from which to query the model for the next state and reward. This question, in fact, already arises in what is arguably the simplest variant of Dyna: Experience Replay (ER) (Lin, 1992) . In ER, visited transitions are stored in a buffer and at each time step, a mini-batch of experiences is sampled to update the value function. ER can be seen as an instance of Dyna, using a (limited) non-parametric model given by the buffer (see van Seijen & Sutton (2015) for a deeper discussion). Performance can be significantly improved by sampling proportionally to priorities based on errors, as in prioritized ER (Schaul et al., 2016; de Bruin et al., 2018) , as well as specialized sampling for the off-policy setting (Schlegel et al., 2019) . Search-control strategies in Dyna similarly often rely on using priorities, though they can be more flexible in leveraging the model rather than being limited to only retrieving visited experiences. For example, a model enables the agent to sweep backwards by generating predecessors, as in prioritized sweeping (Moore & Atkeson, 1993; Sutton et al., 2008; Pan et al., 2018; Corneil et al., 2018) . Other methods have tried alternatives to error-based prioritization, such as searching for states with high reward (Goyal et al., 2019 ), high value (Pan et al., 2019) or states that are difficult to learn (Pan et al., 2020) . Another strategy is to directly generate hypothetical experiences from trajectory optimization algorithms (Gu et al., 2016) . These methods are all supported by nice intuition, but as yet lack solid theoretical reasons for why they can improve sample efficiency. In this work, we provide new insights about how to choose the sampling distribution over states from which we generate hypothetical experience. In particular, we theoretically motivate why errorbased prioritization is effective, and provide a mechanism to generate states according to more accurate error estimates. We first prove that l 2 regression with error-based prioritized sampling is equivalent to minimizing a cubic objective with uniform sampling in an ideal setting. We then show that minimizing the cubic power objective has a faster convergence rate during early learning stage, providing theoretical motivation for error-based prioritization. The theoretical understanding illuminates two issues of prioritized ER: insufficient sample space coverage and outdated priorities. To overcome the limitations, we propose a search-control strategy in Dyna that leverages a model to simulate errors and to find states with high expected error. Finally, we demonstrate the efficacy of our method on various benchmark domains and an autonomous driving application.

2. PROBLEM FORMULATION

We formalize the problem as a Markov Decision Process (MDP), a tuple (S, A, P, R, γ) including state space S, action space A, probability transition kernel P, reward function R, and discount rate γ ∈ [0, 1]. At each environment time step t, an RL agent observes a state s t ∈ S, and takes an action a t ∈ A. The environment transitions to the next state s t+1 ∼ P(•|s t , a t ), and emits a scalar reward signal r t+1 = R(s t , a t , s t+1 ). A policy is a mapping π : S × A → [0, 1] that determines the probability of choosing an action at a given state. Algorithm The agent's objective is to find an optimal policy. A popular algorithm is Qlearning (Watkins & Dayan, 1992) , where parameterized action-values Q θ are updated using θ = θ + αδ t ∇ θ Q θ (s t , a t ) for step- size α > 0 with TD-error δ t def = r t+1 + γ max a ∈A Q θ (s t+1 , a ) -Q θ (s t , a t ). The policy is defined by acting greedily w.r.t. these action-values. ER is critical when using neural networks to estimate Q θ , as used in DQN (Mnih et al., 2015) , both to stabilize and speed up learning. MBRL has the potential to provide even further sample efficiency improvements. We build on the Dyna formalism (Sutton, 1991) for MBRL, and more specifically the recently proposed HC-Dyna (Pan et al., 2019) as shown in Algorithm 1. HC-Dyna provides a particular approach to search-control-the mechanism of generating states or state-action pairs from which to query the model to get next states and rewards (i.e. hypothetical experiences). It is characterized the fact that it generates states by hill climbing on some criterion function h(•). The term Hill Climbing (HC) is used for generality as the vanilla gradient ascent procedure is modified to resolve certain challenges (Pan et al., 2019) 2020). The former is used as measure of the utility of visiting a state and the latter is considered as a measure of value approximation difficulty. The hypothetical experience is obtained by first selecting a state s, then typically selecting the action a according to the current policy, and then querying the model to get next state s and reward r. These hypothetical transitions are treated just like real transitions. For this reason, HC-Dyna combines both real experience and hypothetical experience into mini-batch updates. These n updates, performed before taking the next action, are called planning updates, as they improve the action-value estimatesand so the policy-using a model. However, it should be noted that there are several limitations to the two previous works. First, the HC method proposed by Pan et al. ( 2019) is mostly supported by intuitions, without any theoretical justification to use the stochastic gradient ascent trajectories for search-control. Second, the HC on gradient norm and Hessian norm of the learned value function Pan et al. ( 2020) is supported by some suggestive theoretical evidence, but it suffers from great computation cost and zero gradient due to the high order differentiation (i.e., ∇ s ||∇ s v(s)||) as suggested by the authors. This paper will introduce our novel HC search-control method motivated by overcoming the limitations of the



. Two particular choices have been proposed for h(•): the value function v(s) from Pan et al. (2019) and the gradient magnitude ||∇ s v(s)|| from Pan et al. (

Sample b/2 experiences from B er , add to B Update policy on the mixed mini-batch B

