EXPERIENCE REPLAY WITH LIKELIHOOD-FREE IMPORTANCE WEIGHTS

Abstract

The use of past experiences to accelerate temporal difference (TD) learning of value functions, or experience replay, is a key component in deep reinforcement learning. In this work, we propose to reweight experiences based on their likelihood under the stationary distribution of the current policy, and justify this with a contraction argument over the Bellman evaluation operator. The resulting TD objective encourages small approximation errors on the value function over frequently encountered states. To balance bias and variance in practice, we use a likelihood-free density ratio estimator between on-policy and off-policy experiences, and use the ratios as the prioritization weights. We apply the proposed approach empirically on three competitive methods, Soft Actor Critic (SAC), Twin Delayed Deep Deterministic policy gradient (TD3) and Data-regularized Q (DrQ), over 11 tasks from OpenAI gym and DeepMind control suite. We achieve superior sample complexity on 35 out of 45 method-task combinations compared to the best baseline and similar sample complexity on the remaining 10.

1. INTRODUCTION

Deep reinforcement learning methods have achieved much success in a wide variety of domains (Mnih et al., 2016; Lillicrap et al., 2015; Horgan et al., 2018) . While on-policy methods (Schulman et al., 2017) are effective, using off-policy data often yields better sample efficiency (Haarnoja et al., 2018; Fujimoto et al., 2018) , which is critical when querying the environment is expensive and experiences are difficult to obtain. Experience replay (Lin, 1992 ) is a popular paradigm in off-policy reinforcement learning, where experiences stored in a replay memory can be reused to perform additional updates. When applied to temporal difference (TD) learning of the Q-value function (Mnih et al., 2015) , the use of replay buffers avoids catastrophic forgetting of previous experiences and improves learning. Selecting experiences from the replay buffers using a prioritization strategy (instead of uniformly) can lead to large empirical improvements in terms of sample efficiency (Hessel et al., 2017) . Existing prioritization procedures rely on certain choices of importance sampling; for instance, Prioritized Experience Replay (PER) selects experiences with high TD error more often, and then down-weight the experiences that are frequently sampled in order to become closer to uniform sampling over the experiences (Schaul et al., 2015) . However, this might not work well in actorcritic methods, where the goal is to learn the value function (or Q-value function) induced by the current policy, and following off-policy experiences might be harmful. In this case, it might be more beneficial to perform importance sampling that reflects on-policy experiences instead. Based on this intuition, we investigate a new prioritization strategy for actor-critic methods based on the likelihood (i.e., the frequency) of experiences under the stationary distribution of the current policy (Tsitsiklis et al., 1997) . In actor-critic methods (Konda & Tsitsiklis, 2000) , we can estimate the value function of a policy by minimizing the expected squared difference between the critic network and its target value over a replay buffer; an appropriate replay buffer should properly reflect the discrepancy between critic value functions. We treat a discrepancy as "proper" if it preserves the contraction properties of the Bellman operator, and consider discrepancies measured by the expected squared distances under some state-action distribution. In Theorem 1 we prove that the stationary distribution of the current policy is the only distribution in which the Bellman operator is a contraction (i.e. being "proper"); this motivates the use of the stationary distribution as the underlying distribution for the replay buffer. Intuitively, optimizing the expected TD-error under the stationary distribution addresses the TD-learning issue in actor-critic methods, as the TD errors in high-frequency states are given more weight. To use replay buffers derived from the stationary distribution with existing deep reinforcement learning methods, we need to be mindful of the following bias-variance trade-off. We have fewer experiences from the current policy (using which results in high variance), but more experiences from other policies under the same environment (using which results in high bias). We propose to find appropriate bias-variance trade-offs by using importance sampling over the replay buffer, which requires an estimate of the density ratio between the stationary policy distribution and the replay buffer. Inspired by recent advances in inverse reinforcement learning (Fu et al., 2017) and off-policy policy evaluation (Grover et al., 2019) , we use a likelihood-free method to obtain an estimate of the density ratio from a classifier trained to distinguish different types of experiences. We consider a smaller, "fast" replay buffer that contains near on-policy experiences, and a larger, "slow" replay buffer that contains additional off-policy experiences, and estimate density ratios between the two buffers. We then use these estimated density ratios as importance weights over the Q-value function update objective. This encourages more updates over state-action pairs that are more likely under the stationary policy distribution of the current policy, i.e., closer to the fast replay buffer. Our approach can be readily combined with existing approaches that learn value functions from replay buffers. We consider our approach over three competitive actor-critic methods, Soft Actor-Critic (SAC, Haarnoja et al. ( 2018)), Twin Delayed Deep Deterministic policy gradient (TD3, Fujimoto et al. ( 2018)) and Data-regularized Q (DrQ, Kostrikov et al. (2020) ). We demonstrate the effectiveness of our approach over on 11 environments from OpenAI gym (Dhariwal et al., 2017) and DeepMind Control Suite (Tassa et al., 2018) , where both low-dimensional state space and high-dimensional image space are considered; this results in 45 method-task combinations in total. Notably, our approach outperforms the respective baselines in 35 out of the 45 cases, while being competitive in the remaining 10 cases. This demonstrates that our method can be applied as a simple plug-and-play approach to improve existing actor-critic methods.

2. PRELIMINARIES

The reinforcement learning problem can be described as finding a policy for a Markov decision process (MDP) defined as the following tuple (S, A, P, r, γ, p 0 ), where S is the state space, A is the action space, P : S × A → P(S) is the transition kernel, r : S × A → R is the reward function, γ ∈ [0, 1) is the discount factor and p 0 ∈ P(S) is the initial state distribution. The goal is to learn a stationary policy π : S → P(A) that selects actions in A for each state s ∈ S, such that the policy maximizes the expected sum of rewards: J(π) := E π [ ∞ t=0 γ t r(s t , a t )] , where the expectation is over trajectories sampled from s 0 ∼ p 0 , a t ∼ π(•|s t ), and s t+1 ∼ P (•|s t , a t ) for t ≥ 0. For a fixed policy, the MDP becomes a Markov chain, so we define the state-action distribution at timestep t: d π t (s, a), and the the corresponding (unnormalized) stationary distribution over states and actions d π (s, a) = ∞ t=0 γ t d π t (s, a) (we assume this always exists for the policies we consider). We can then write J(π) = E d π [r(s, a)]. For any stationary policy π, we define its corresponding stateaction value function as Q π (s, a) := E π [ ∞ t=0 γ t r(s t , a t )|s 0 = s, a 0 = a], its corresponding value function as V π (s) := E a∼π(•|s) [Q π (s, a)] and the advantage function A π (s, a) = Q π (s, a) -V π (s). A large variety of actor-critic methods (Konda & Tsitsiklis, 2000) have been developed in the context of deep reinforcement learning (Silver et al., 2014; Mnih et al., 2016; Lillicrap et al., 2015; Haarnoja et al., 2018; Fujimoto et al., 2018) , where learning good approximations to the Q-function is critical to the success of any deep reinforcement learning method based on actor-critic paradigms. The Q-function can be learned via temporal difference (TD) learning (Sutton, 1988) (2)



based on the Bellman equation Q π (s, a) = B π Q π (s, a); where B π denotes the Bellman evaluation operatorB π Q(s, a) := r(s, a) + γE s ,a [Q(s , a )],(1)where in the expectation we sample the next step, s ∼ P (•|s, a) and a ∼ π(•|s). Given some experience replay buffer D (collected by navigating the same environment, but with unknown and potentially different policies), one could optimize the following loss for a Q-network:L Q (θ; D) = E (s,a)∼D (Q θ (s, a) -Bπ Q θ (s, a)) 2

