EXPERIENCE REPLAY WITH LIKELIHOOD-FREE IMPORTANCE WEIGHTS

Abstract

The use of past experiences to accelerate temporal difference (TD) learning of value functions, or experience replay, is a key component in deep reinforcement learning. In this work, we propose to reweight experiences based on their likelihood under the stationary distribution of the current policy, and justify this with a contraction argument over the Bellman evaluation operator. The resulting TD objective encourages small approximation errors on the value function over frequently encountered states. To balance bias and variance in practice, we use a likelihood-free density ratio estimator between on-policy and off-policy experiences, and use the ratios as the prioritization weights. We apply the proposed approach empirically on three competitive methods, Soft Actor Critic (SAC), Twin Delayed Deep Deterministic policy gradient (TD3) and Data-regularized Q (DrQ), over 11 tasks from OpenAI gym and DeepMind control suite. We achieve superior sample complexity on 35 out of 45 method-task combinations compared to the best baseline and similar sample complexity on the remaining 10.

1. INTRODUCTION

Deep reinforcement learning methods have achieved much success in a wide variety of domains (Mnih et al., 2016; Lillicrap et al., 2015; Horgan et al., 2018) . While on-policy methods (Schulman et al., 2017) are effective, using off-policy data often yields better sample efficiency (Haarnoja et al., 2018; Fujimoto et al., 2018) , which is critical when querying the environment is expensive and experiences are difficult to obtain. Experience replay (Lin, 1992 ) is a popular paradigm in off-policy reinforcement learning, where experiences stored in a replay memory can be reused to perform additional updates. When applied to temporal difference (TD) learning of the Q-value function (Mnih et al., 2015) , the use of replay buffers avoids catastrophic forgetting of previous experiences and improves learning. Selecting experiences from the replay buffers using a prioritization strategy (instead of uniformly) can lead to large empirical improvements in terms of sample efficiency (Hessel et al., 2017) . Existing prioritization procedures rely on certain choices of importance sampling; for instance, Prioritized Experience Replay (PER) selects experiences with high TD error more often, and then down-weight the experiences that are frequently sampled in order to become closer to uniform sampling over the experiences (Schaul et al., 2015) . However, this might not work well in actorcritic methods, where the goal is to learn the value function (or Q-value function) induced by the current policy, and following off-policy experiences might be harmful. In this case, it might be more beneficial to perform importance sampling that reflects on-policy experiences instead. Based on this intuition, we investigate a new prioritization strategy for actor-critic methods based on the likelihood (i.e., the frequency) of experiences under the stationary distribution of the current policy (Tsitsiklis et al., 1997) . In actor-critic methods (Konda & Tsitsiklis, 2000) , we can estimate the value function of a policy by minimizing the expected squared difference between the critic network and its target value over a replay buffer; an appropriate replay buffer should properly reflect the discrepancy between critic value functions. We treat a discrepancy as "proper" if it preserves the contraction properties of the Bellman operator, and consider discrepancies measured by the expected squared distances under some state-action distribution. In Theorem 1 we prove that the stationary distribution of the current policy is the only distribution in which the Bellman operator is a contraction (i.e. being "proper"); this motivates the use of the stationary distribution as the underlying distribution for the replay buffer. Intuitively, optimizing the expected TD-error under the stationary distribution addresses the TD-learning issue in actor-critic methods, as the TD errors in high-frequency states are given more weight. To use replay buffers derived from the stationary distribution with existing deep reinforcement learning methods, we need to be mindful of the following bias-variance trade-off. We have fewer experiences from the current policy (using which results in high variance), but more experiences from other policies under the same environment (using which results in high bias). We propose to find appropriate bias-variance trade-offs by using importance sampling over the replay buffer, which requires an estimate of the density ratio between the stationary policy distribution and the replay buffer. Inspired by recent advances in inverse reinforcement learning (Fu et al., 2017) and off-policy policy evaluation (Grover et al., 2019) , we use a likelihood-free method to obtain an estimate of the density ratio from a classifier trained to distinguish different types of experiences. We consider a smaller, "fast" replay buffer that contains near on-policy experiences, and a larger, "slow" replay buffer that contains additional off-policy experiences, and estimate density ratios between the two buffers. We then use these estimated density ratios as importance weights over the Q-value function update objective. This encourages more updates over state-action pairs that are more likely under the stationary policy distribution of the current policy, i.e., closer to the fast replay buffer. Our approach can be readily combined with existing approaches that learn value functions from replay buffers. We consider our approach over three competitive actor-critic methods, Soft Actor-Critic (SAC, Haarnoja et al. (2018) ), Twin Delayed Deep Deterministic policy gradient (TD3, Fujimoto et al. (2018) ) and Data-regularized Q (DrQ, Kostrikov et al. (2020) ). We demonstrate the effectiveness of our approach over on 11 environments from OpenAI gym (Dhariwal et al., 2017) and DeepMind Control Suite (Tassa et al., 2018) , where both low-dimensional state space and high-dimensional image space are considered; this results in 45 method-task combinations in total. Notably, our approach outperforms the respective baselines in 35 out of the 45 cases, while being competitive in the remaining 10 cases. This demonstrates that our method can be applied as a simple plug-and-play approach to improve existing actor-critic methods.

2. PRELIMINARIES

The reinforcement learning problem can be described as finding a policy for a Markov decision process (MDP) defined as the following tuple (S, A, P, r, γ, p 0 ), where S is the state space, A is the action space, P : S × A → P(S) is the transition kernel, r : S × A → R is the reward function, γ ∈ [0, 1) is the discount factor and p 0 ∈ P(S) is the initial state distribution. The goal is to learn a stationary policy π : S → P(A) that selects actions in A for each state s ∈ S, such that the policy maximizes the expected sum of rewards: J(π) := E π [ ∞ t=0 γ t r(s t , a t )], where the expectation is over trajectories sampled from s 0 ∼ p 0 , a t ∼ π(•|s t ), and s t+1 ∼ P (•|s t , a t ) for t ≥ 0. For a fixed policy, the MDP becomes a Markov chain, so we define the state-action distribution at timestep t: d π t (s, a), and the the corresponding (unnormalized) stationary distribution over states and actions d π (s, a) = ∞ t=0 γ t d π t (s, a) (we assume this always exists for the policies we consider). We can then write J(π) = E d π [r(s, a)]. For any stationary policy π, we define its corresponding stateaction value function as Q π (s, a) := E π [ ∞ t=0 γ t r(s t , a t )|s 0 = s, a 0 = a], its corresponding value function as V π (s) := E a∼π(•|s) [Q π (s, a)] and the advantage function A π (s, a) = Q π (s, a) -V π (s). A large variety of actor-critic methods (Konda & Tsitsiklis, 2000) have been developed in the context of deep reinforcement learning (Silver et al., 2014; Mnih et al., 2016; Lillicrap et al., 2015; Haarnoja et al., 2018; Fujimoto et al., 2018) , where learning good approximations to the Q-function is critical to the success of any deep reinforcement learning method based on actor-critic paradigms. The Q-function can be learned via temporal difference (TD) learning (Sutton, 1988) based on the Bellman equation Q π (s, a) = B π Q π (s, a); where B π denotes the Bellman evaluation operator B π Q(s, a) := r(s, a) + γE s ,a [Q(s , a )], where in the expectation we sample the next step, s ∼ P (•|s, a) and a ∼ π(•|s). Given some experience replay buffer D (collected by navigating the same environment, but with unknown and potentially different policies), one could optimize the following loss for a Q-network: L Q (θ; D) = E (s,a)∼D (Q θ (s, a) -Bπ Q θ (s, a)) 2 (2) which fits Q θ (s, a) to an estimate of the target value Bπ [Q θ ](s, a)foot_0 . In practice, the target values can be estimated either via on-policy experiences (Sutton et al., 1999) or via off-policy experiences (Precup, 2000; Munos et al., 2016) . Ideally, we can learn Q π by optimizing the L Q (θ; D) to zero with over-parametrized neural networks. However, instead of minimizing the loss L Q (θ; D) directly, prioritization over the sampled replay buffer D could lead to stronger performance. For example, prioritized experience replay (PER, (Schaul et al., 2015) ) is a heuristic that assigns higher weights to transitions with higher TD errors, and is applied successfully in deep Q-learning (Hessel et al., 2017) .

3. PRIORITIZED EXPERIENCE REPLAY BASED ON STATIONARY DISTRIBUTIONS

Assume that d, the distribution the replay buffer D is sampled from, is supported on the entire space S × A, and that we have infinite samples from π (so the Bellman target is unbiased). Let us define the TD-learning objective for Q with prioritization weights w : S × A → R + , under the sampling distribution d ∈ P(S × A): L Q (θ; d, w) = E d w(s, a)(Q θ (s, a) -B π Q θ (s, a)) 2 In practice, the expectation in L Q (θ; d, w) can be estimated with Monte-Carlo methods, such as importance sampling, rejection sampling, or combinations of multiple methods (such as in PER (Schaul et al., 2015) ). Without loss of generality, we can treat the problem as optimizing the mean squared TD error under some priority distribution d w ∝ d • w, since: arg min θ L Q (θ; d, w) = arg min θ L Q (θ; d w ), so one could treat prioritized experience replay for TD learning as selecting a favorable priority distribution d w (under which the L Q loss is computed) in order to improve some notion of performance. In this paper, we propose to use as priority distribution d w = d π , where d π is the stationary distribution of state-action pairs under the current policy π. This reflects the intuition that TDerrors in high-frequency state-action pairs are more problematic than in low-frequency ones, as they will negatively impact policy updates more severely. In the following subsection, we argue the importance of choosing d π from the perspective of maintaining desirable contraction properties of the Bellman operators under more general norms. If we consider Euclidean norms weighted under some distribution d w ∈ P(S × A), the usual γ-contraction argument for Bellman operators holds only for d w = d π , and not for other distributions.

3.1. POLICY-DEPENDENT NORMS FOR BELLMAN BACKUP

The convergence of Bellman updates relies on the fact that the Bellman evaluation operator B π is a γ-contraction with respect to the ∞ norm, i.e. ∀Q, Q ∈ Q, where Q = {Q : (S × A) → R} is the set of all possible Q functions: B π Q -B π Q ∞ ≤ γ Q -Q ∞ (5) While it is sufficient to show convergence results, the ∞ norm reflects a distance over two Q functions under the worst possible state-action pair, and is independent of the current policy. If two Q functions are equal everywhere except for a large difference on a single state-action pair ( s, a) that is unlikely under d π , the ∞ distance between the two Q functions is large. In practice, however, this will have little effect over policy updates as it is unlikely for the current policy to sample ( s, a). Since our goal with the TD updates is to learn Q π , a distance metric that is related to π is a more suitable one for comparing different Q functions, reflecting the intuition that errors in frequent state-action pairs are more costly than on infrequent ones. Let us consider the following weighted 2 distance between Q functions, Q -Q 2 d := E (s,a)∼d [(Q(s, a) -Q (s, a)) 2 ] ( ) where d ∈ P(S × A) is a distribution over state-action pairs. This can be treated as the 2 norm but measured over stationary distribution d as opposed to the Lebesgue measure. This is closely tied to the L Q objective since L Q (θ; d) = Q θ (s, a) -B π Q θ (s, a) 2 d In the following statements, we show that B π is only a contraction operator when under the • d π norm; this supports the use of d π instead of other distributions for the L Q objective, as it reflects a more reasonable measurement of distance between Q-functions for policy π. Lemma 1. For all γ ∈ (0, 1), the Bellman operator B π is a γ-contraction with respect to the • d norm if d = d π holds almost everywhere, i.e., d = d π a.e. =⇒ B π Q -B π Q d ≤ γ Q -Q d , ∀Q, Q ∈ Q Proof. In Appendix B. On a high-level, we apply Jensen's inequality to f (x) = x 2 . Theorem 1. For all γ ∈ (0, 1), the Bellman operator B π is a γ-contraction with respect to the • d norm if and only if d = d π holds almost everywhere, i.e., d = d π , a.e. ⇐⇒ B π Q -B π Q d ≤ γ Q -Q d , ∀Q, Q ∈ Q Proof. In Appendix B. On a high-level, whenever d = d π does not hold over some non-empty open set, we can perturb a constant Q-value function over this set to contradict γ-contraction. Theorem 1 highlights the importance of using d π in the • d norm specifically for measuring the distance between Q-functions: if we use any distribution other than d π , the Bellman operator is not guaranteed to be a γ-contraction under that distance, which leads to worse convergence rates.

3.2. TD LEARNING BASED ON d π

The • d π norm also captures our intuition that errors in high-frequency state-action pairs are more problematic than low-frequency ones, as they are likely to have larger effect in policy learning. For example, for the actor-critic policy gradient with Q θ : ∇ φ J(π φ ) = E d π [∇ φ log π φ (a|s)Q θ (s, a)]; (7) if (Q θ (s, a) -Q π (s, a) ) 2 is large for high-frequency (s, a) tuples, then the policy update is likely to be worse than the update with ground truth Q π . Moreover, the gradient descent update over the objective L Q (θ; d π ): θ ← θ -η∇ θ L Q (θ; d π ) = θ -ηE d π [(Q θ (s, a) -Bπ Q θ (s, a))∇ θ Q θ (s, a)] corresponds to a batch version of TD update. This places more emphasis on TD errors for state-action pairs that occur more frequently under the current policy. To illustrate the validity of using d π , we consider a chain MDP example (Figure 1a , but with 5 states in total), where the agent takes two actions that progress to the state on the left or on the right. The agent receives a final reward of 1 at the right-most state and rewards of 0 at other states. The policy takes the right action at each state with probability p, and takes the left action for with probability 1 -p. We initialize the Q-function from [0, 1] uniformly at random and consider p = 0.8 and 0.2. We compare three approaches to prioritization with TD updates: uniform over all state-action pairs, prioritization over TD error (as done in Schaul et al. (2015) ), and prioritization with d π ; we include more details in Appendix C. We illustrate the • 2 d π distance between the learned Q-function and the ground truth Q-function in Figure 1b ; prioritization with d π outperforms both uniform sampling and prioritization with TD error in terms of speed of converging to the ground truth, especially at the initial iterations. When p = 0.8, d π only takes 120 steps on average to decrease the expected error to be smaller than 1, while TD error takes 182 steps on average; this means that prioritization with d π is helpful when we have a limited update budget. 

4. LIKELIHOOD-FREE IMPORTANCE WEIGHTING OVER REPLAY BUFFERS

In practice, however, there are two challenges with regards to using L Q (θ; d π ) as the objective. On the one hand, an accurate estimate of d π requires many on-policy samples from d π and interactions with the environment, which could increase the practical sample complexity; on the other hand, if we instead use off-policy experiences from the replay buffer, it would be difficult to estimate the importance ratio w(s, a) := d π (s, a)/d D (s, a) when the replay buffer D is a mixture of trajectories from different policies. Therefore, likelihood-free density ratio estimation methods that rely only on samples (e.g. from the replay buffer) rather than likelihoods are more general and well-suited for estimating the objective function L Q (θ; d π ) with a good bias-variance trade-off. An appropriate choice of importance weights allows us to balance bias (which comes from replay experiences of other policies) and variance (which comes from a small number of on-policy experiences). In this paper, we use the variational representation of f -divergences (Csiszar, 1964) to estimate the density ratios. For any convex, lower-semicontinuous function f : [0, ∞) → R satisfying f (1) = 0, the f -divergence between two probabilistic measures P, Q ∈ P(X ) (where we assume P Q, i.e. P is absolutely continuous w.r.t. Q) is defined as: D f (P Q) = X f (dP (x)/ dQ(x)) dQ(x). A general variational method can be used to estimate f -divergences given only samples from P and Q. Lemma 2 (Nguyen et al. (2008) ). Assume that f has first order derivatives f at [0, +∞). ∀P, Q ∈ P(X ) such that P Q and w : X → R + , D f (P Q) ≥ E P [f (w(x))] -E Q [f * (f (w(x)))] where f * denotes the convex conjugate and the equality is achieved when w = dP/ dQ. We can apply this approach to estimating the density ratio w(s, a) := d π (s, a)/d D (s, a) with samples from the replay buffer. These ratios are then multiplied to the Q-function updates to perform importance weighting. Specifically, we consider sampling from two types of replay buffers. One is the regular (slow) replay buffer, which contains a mixture of trajectories from different policies; the other is a smaller (fast) replay buffer, which contains only a small set of trajectories from very recent policies. After each episode of environment interaction, we update both replay buffers with the new experiences; the distribution of the slow replay buffer changes more slowly due to the larger size (hence the name "slow"). The slow replay buffer contains off-policy samples from d D whereas the fast replay buffer contains (approximately) on-policy samples from d π (assuming the buffer size is small enough). Therefore, the slow replay buffer has better coverage of transition dynamics of the environment while being less on-policy. Denoting the fast and slow replay buffers as D f and D s respectively, we estimate the ratio d π /d D via minimizing the following objective over the network w ψ (x) parametrized by ψ (the outputs w ψ (s, a) are forced to be non-negative via activation functions): L w (ψ) := E Ds [f * (f (w ψ (s, a)))] -E D f [f (w ψ (s, a))] From Lemma 2, we can recover an estimate of the density ratio from the optimal w ψ by minimizing the L w (ψ) objective. To address the finite sample size issue, we apply self-normalization (Cochran, 2007) to the importance weights over the slow replay buffer D s with a hyperparameter T : wψ (s, a) := w ψ (s, a) 1/T E Ds [w ψ (s, a) 1/T ] The final objective for TD learning over Q is then L Q (θ; d π ) ≈ L Q (θ; D s , wψ ) := E (s,a)∼Ds [ wψ (x)(Q θ (s, a) -Bπ Q θ (s, a)) 2 ] (11) where the target Bπ Q θ is estimated via Monte Carlo samples. We keep the remainder of the algorithm, such as policy gradient and value network update (if available) unchanged, so this method can be adapted for different off-policy actor-critic algorithms, utilizing their respective advantages. We observe that using the weights to correct the policy updates does not provide much marginal improvements, so we did not consider this for comparison. We describe a general procedure of our approach in Algorithm 1 (Appendix A), where one can modify from some "base" actor-critic algorithm to implement our approach. These algorithm cover both stochastic and deterministic policies, as our method does not require likelihood estimates from the policy. We consider our divergences to be Jensen-Shannon, so w ψ can be treated as a probabilistic classifier.

5. RELATED WORK

Experience replay (Lin, 1992 ) is a crucial component in deep reinforcement learning (Hessel et al., 2017; Andrychowicz et al., 2017; Schaul et al., 2015) , where off-policy experiences are utilized to improve sample efficiency. These experiences can be utilized on policy updates (such as in actorcritic methods (Konda & Tsitsiklis, 2000; Wang et al., 2016) ), on value updates (such as in deep Q-learning (Schaul et al., 2015) ) or on evaluating TD update targets (Precup, 2000; Precup et al., 2001) . For value updates, there are two sources of randomness that could benefit from importance weights (prioritization). The first source is the evaluation of the TD learning target for longer traces such as TD(λ); importance weights can be used to debias targets computed from off-policy trajectories (Precup, 2000; Munos et al., 2016; Espeholt et al., 2018; Schmitt et al., 2019) , similar to its role in policy learning. The second source is the sampling of state-action pairs where the values are updated (Schaul et al., 2015) , which is addressed in this paper. 2020)) suggests against using on-policy experiences, which seems to be in contrast to what we have promoted. However, their analysis is based on the Bellman optimality operator, which aims to find the optimal Q-value function, while ours is based on the Bellman evaluation operator, which aims to find the Q-value function under the current policy; this could partially explain why DisCor did not achieve superior performance than the baseline approach on OpenAI gym tasks. Likelihood-free density ratio estimation have been adopted in imitation learning Ho & Ermon (2016) , inverse reinforcement learning (Fu et al., 2017) and model-based off-policy policy evaluation (Grover et al., 2019) . Different from these cases, we do not use the weights to estimate the advantage function or to reduce bias in reward estimation; our goal is to improve performance of TD learning with function approximation. Dual representations of f -divergences are also leveraged in reinforcement learning (Nachum et al., 2019; Nachum & Dai, 2020) , but it is used over a regularizer that encourages exploration to be closer to off-policy experiences; the importance weights are added to the reward function when computing the Q-value function but do not affect the replay experiences otherwise.

6. EXPERIMENTS

We combine the proposed prioritization approach over three popular actor-critic algorithms, namely Soft-Actor Critic (SAC, Haarnoja et al. (2018) ), Twin Delayed Deep Deterministic policy gradient (TD3, Fujimoto et al. ( 2018)) and Data-regularized Q (DrQ, Kostrikov et al. (2020) ). We compare our method with alternative approaches to prioritization; these include uniform sampling over the replay buffer (adopted by the original SAC and TD3 methods) and prioritization experience replay based on TD-error (Schaul et al., 2015) . We choose 5 continuous control tasks from the OpenAI gym 2018)). We consider state representations in all tasks and pixel representations from DCS. Our method introduces some additional hyperparameters compared to the vanilla approaches, namely the temperature T , the size of the fast replay buffer |D f | and the architecture for the density estimator w ψ . To ensure fair comparisons against the baselines, we use the same hyperparameters as the original algorithms when it is available. For all environments we use the following default hyperparameters for likelihood-free importance weighting: T = 5, |D f | = 10 4 , |D s | = 10 6 . We use f from the Jensen Shannon divergence for better numerical stability. We include more experimental details in Appendix C.

6.1. EVALUATION

We use (+LFIW) to denote our likelihood-free importance weighting method, (+PER) to denote prioritization with TD error (Schaul et al., 2015) foot_1 and (+ERE) to denote Emphasizing Recent Experience (ERE, Wang & Ross (2019) ) for SAC only. Table 1 shows the results on OpenAI gym (500k steps), whereas Tables 2 and 4 shows the results on DMCS with state (100k and 250k steps) and image representations (100k and 500k steps) respectively. These steps are chosen to demonstrate both initial training progress and approximate performance at convergence.

OpenAI Gym results

Table 1 demonstrates that in terms of performance at 500k steps, our LFIW method is able to outperform baseline methods on most tasks (except Hopper-v2 with TD3). On the other hand, PER and ERE do not perform very favorably against uniform sampling; a similar phenomenon has been observed by (Novati & Koumoutsakos, 2018) for PER on other actor-critic algorithms, such as DDPG (Lillicrap et al., 2015) and PPO (Schulman et al., 2017) . We also considered combining PER with LFIW, but achieved little initial success. We believe this is the case because PER is designed for Q-learning instead of actor-critic methods, where learning the max Q-function is the objective.

DCS results

Table 2 and 4 shows the results with SAC and TD3 on state representations and DrQ on pixel representations. Again, we observe improvements over the baselines in most cases, and comparable performance in others. Notably, we achieve much higher performance with LFIW at 100k training steps, which demonstrates that biasing the replay buffer towards on-policy experiences is able to achieve good policy performance more quickly.

6.2. ADDITIONAL ANALYSES

To illustrate the advantage of our method, we perform further analyses over the classification accuracy of w ψ and the quality of the Q-function estimates over the Humanoid-v2 environment trained with SAC and SAC + LFIW. In Appendix C, we also include additional ablation studies over the hyperparameters introduced by LFIW, including temperature T , replay buffer size |D f | and number of hidden units in w ψ . We observe that SAC + LFIW is insensitive to these hyperparameter changes, which suggests that the same LFIW hyperparameters can be applied readily to other tasks. Accuracy of w ψ We use w ψ to discriminate two types of experiences; experiences sampled from the policy trained with SAC for 5M steps are labeled positive, and the mixture of experiences sampled from policies trained for 1M to 4M steps are labeled negative. With the w ψ predictions, we obtain a precision of 87.3% and an accuracy of 73.1%. This suggests that the importance weights tends to be higher for on-policy data as desired, and the weights indeed allows the replay buffer to be closer to on-policy experiences. Quality of Q-estimates We compare the quality of the Q-estimates between SAC and SAC+LFIW, where we sample 20 trajectories from each policy, and obtain the "ground truth" via Monte Carlo estimates of the true Q-value. We then evaluate the learned Q-function estimates and compare their correlations with the ground truth values. For the SAC case, the Pearson and Spearman correlations are 0.41 and 0.11 respectively, whereas for SAC+LFIW method they are 0.74 and 0.42 (higher is better). This shows how our Q-function estimates are much more reflective of the "true" values, which explains the improvements in sample complexity and the performance of the learned policy.

7. CONCLUSION

In this paper, we propose a principled approach to prioritized experience replay for TD-learning of Q-value functions in actor-critic methods, where we re-weigh the replay buffer to be closer to on-policy experiences. We justify this by showing theoretically that when we measure the discrepancy between function with the expected squared differences under state-action distributions, only the on-policy distribution ensures that the Bellman evaluation operator is a contraction for the resulting discrepancy. To achieve a good bias-variance trade-off in practice, we assign weights to the replay buffer based on their estimated density ratios against the stationary distribution. These density ratios are estimated via samples from fast and slow replay buffers, which reflect on-policy and off-policy experiences respectively. Our methods can be readily applied to deep reinforcement learning methods based on actor-critic approaches. Empirical results on SAC, TD3 as well as DrQ on 11 environments and 45 method-task combinations demonstrate that our method based on likelihood-free importance weighting is able to achieve superior sample complexity on most cases compared to other methods, so our approach can be applied as a plug-and-play method to improve actor-critic methods. A ALGORITHM 

B PROOFS

Lemma 1. For all γ ∈ (0, 1), the Bellman operator B π is a γ-contraction with respect to the • d norm if d = d π holds almost everywhere, i.e., d = d π a.e. =⇒ B π Q -B π Q d ≤ γ Q -Q d , ∀Q, Q ∈ Q Proof. From the definitions of • d and B π , we have: Theorem 1. For all γ ∈ (0, 1), the Bellman operator B π is a γ-contraction with respect to the • d norm if and only if d = d π holds almost everywhere, i.e., B π Q -B π Q 2 d (12) = E (s,a)∼d [(γE s ,a [Q(s , a )] -γE s ,a [Q (s , a )]) 2 ] = γ 2 E (s,a)∼d [(E s ,a [Q(s , a ) -Q (s , a )]) 2 ] ≤ γ 2 E (s,a)∼d [E s ,a [(Q(s , a ) -Q (s , a )) 2 ]] (13) = γ 2 E (s,a)∼d [(Q(s, a) -Q (s, a)) 2 ] (14) = γ 2 Q -Q 2 d ( d = d π , a.e. ⇐⇒ B π Q -B π Q d ≤ γ Q -Q d , ∀Q, Q ∈ Q Proof. The "if" case is apparent from the Lemma, so we only need to prove the "only if" case. For the "only if" case, we construct a counter-example for all d = d . Without loss of generality, assume ∀s, a ∈ S × A, Q (x) = 0. The following functionals of Q , a) 2 ] corresponds to the quantities at the two ends of the contraction argument. Our goal is to find some Q ∈ Q such that h(Q) -g(Q) > 0, which would complete the contradiction. We can evaluate the functional derivatives for h(Q) and g(Q):  h(Q) := B π Q -B π Q 2 d /γ 2 = E (s,a)∼d [(E s ,a [Q(s , a )]) 2 ] g(Q) := Q -Q 2 d = E (s,a)∼d [Q(s

C.2 ABLATION STUDIES

To demonstrate the stability of our method across different hyperparameters, we conduct further analyses over the key hyperparameters, including the temperature T in Eq. 10, the size of the fast replay buffer |D f |, and the number of hidden units in the classifier model w ψ . We consider running the SAC+LFIW method on the Walker-v2 environment with 1000 episodes using all the default hyperparameters unless explicitly changed. 



We also do not take the gradient over the target, which is the more conventional approach. We use α = 0.6, β = 0.4 in PER.



Estimation error (E d π [(Q θ -Q π ) 2 ]) for different prioritization methods, including uniform sampling (Uniform), sampling based on TD error (TD error), and sampling based on d π (Policy).

Figure 1: Simulation of TD updates with different prioritization methods.

) where s ∼ P (•|s, a), a ∼ π(•|s ) and d (s , a ) = s,a P (s |s, a)π(a |s )d(s, a) represents the state-action distribution of the next step when the current distribution is d. We use Jensen's inequality over the convex function (•) 2 in Eq. 13. Since d π is the stationary distribution, d = d ⇐⇒ d = d π , a.e., so the if direction holds.

a)E(s, a)P (s |s, a)π(a |s ) dg dQ (s, a) = 2d(s, a)Q(s, a) where E(s, a) = E s ∼P (•|s,a),a ∼π(•|s ) [Q(s , a )] is the expected Q function of the next step when the current step is at (s, a). Now let us consider some Q 0 such that for some constant q > 0, ∀s, a ∈ S × A, Q 0 (s, a) = q . Let us then evaluate both functional derivatives at Q 0 : , a)qP (s |s, a)π(a |s ) = 2d (s , a )q dg dQ (s, a) Q0 = 2d(s, a)q where E(s, a) = q under the current Q 0 . Because d and d are not equal almost everywhere (from the assumption of d not being the stationary distribution), there must exist some non-empty open set Γ ∈ S × A where Γ (d (s, a) -d(s, a)) ds da > 0. We can then add a function : S × A → R such that (s, a) = ν • I((s, a) ∈ Γ) where I is the indicator function and ν is an infinitesimal amount. Now let us evaluate (h -g) at (Q 0 s, a) -d(s, a))ν ds da + o(ν) > 0 Therefore, the proposed function (Q + ) is the contradiction we need.C ADDITIONAL EXPERIMENTAL DETAILS C.1 SETUP ON CHAIN MDPThe chain MDP considered has deterministic transitions, so we make the policy stochastic to make sure that a stationary distribution exists. In each epoch over all the state-action pairs, we use the following TD learning update over tabular data to simulate the effect of weighting with fixed learning rate η:Q(s, a) → Q(s, a) + (1 -(1 -η) w(s,a) )(B π Q(s, a) -Q(s, a))(16)where w(s, a) is the weight (uniform, TD error or d π ) which simulates the number of TD updates with learning rate η. The weights are normalized to have a mean value of 1 which makes number of updates per epoch the same across different methods.

Hidden layer size of w ψ

Figure 2: Hyperparameter sensitivity analyses on Walker2d-v2 with SAC.

Results on OpenAI Gym environments when trained with 500k steps. ERE is only designed for SAC, so its results on TD3 are not available.

Results of SAC and TD3 trained from states on the DeepMind Control environments with and without LFIW after 100k and 250k environment steps. The results show significant improvements when the agents is trained with LFIW. Results are reported over 5 random seeds. The maximum possible score for any environment is 1,000. 20 Ball in Cup, Catch 278 ± 29 213 ± 11 418 ± 44 664 ± 21 687 ± 24 872 ± 27

Results of SAC and TD3 trained from states on the DeepMind Control environments with and without LFIW after 100k and 250k environment steps. The results show significant improvements when the agents is trained with LFIW. Results are reported over 5 random seeds. The maximum possible score for any environment is 1,000.

Results for DrQ(Kostrikov et al., 2020) on the image-based RL on the DeepMind Control Suite. LFIW is applied to a state-of-the-art image-based RL algorithm in DrQ, and we are able to see consistent improvement over the DM Control Suite Benchmark.

Algorithm 1 Actor Critic with Likelihood-free Importance Weighted Experience Replay

Additional hyperparameters for DrQ(Kostrikov et al., 2020)

annex

Temperature T The temperature T affects the variances of the weights assigned; a larger T makes the weights more similar to each other, while a smaller T relies more on the outputs of the classifier. Since we are using finite replay buffers, using a larger temperature reduces the chances of negatively impacting performance due to w ψ overfitting the data. We consider T = 1, 2.5, 5, 7.5, 10 in Figure 2a ; all cases have similar sample efficiencies except for T = 1. Similarly, we also perform a similar analysis on Humanoid-v2 with SAC in Figure 3 . We observe a similar dependency on T as in Walker where the sample efficiency with T = 1 is significantly worse that for the other hyperparameters considered, which shows that overfitting the data can easily be avoided by using a higher temperature value even in higher-dimensional state-action distributions.Replay buffer sizes |D f | The replay buffer sizes |D f | affects the amount of experiences we treat as "on-policy". Larger |D f | reduces the risk of overfitting while increasing the chances of including more off-policy data. We consider |D f | = 1000, 10000, 50000, 100000, corresponding to 1 to 100 episodes. We note that |D s | = 10 6 , so even for the largest D f , D s is significantly larger. The performance are relatively stable despite a small drop for |D f | = 100000.Hidden units of w ψ The number of hidden units at each layer affects the expressiveness of the neural network. While networks with more hidden units are more expressive, they are easier to overfit to the replay buffers. We consider hidden layers with 128, 256 and 512 neurons respectively. While the smaller network with 128 units is able to achieve superior performance initially, the other configurations are able to catch up at around 1000 episodes.

