BEYOND PRIORITIZED REPLAY: SAMPLING STATES IN MODEL-BASED RL VIA SIMULATED PRIORITIES

Abstract

The prioritized Experience Replay (ER) method has attracted great attention; however, there is little theoretical understanding of such prioritization strategy and why they help. In this work, we revisit prioritized ER and, in an ideal setting, show equivalence to minimizing cubic loss, providing theoretical insight into why it improves upon uniform sampling. This theoretical equivalence highlights two limitations of current prioritized experience replay methods: insufficient coverage of the sample space and outdated priorities of training samples. This motivates our model-based approach, which does not suffer from these limitations. Our key idea is to actively search for high priority states using gradient ascent. Under certain conditions, we prove that the hypothetical experiences generated from these states are sampled proportionally to approximately true priorities. We also characterize the distance between the sampling distribution of our method and the true prioritized sampling distribution. Our experiments on both benchmark and application-oriented domains show that our approach achieves superior performance over baselines.

1. INTRODUCTION

Using hypothetical experience simulated from an environment model can significantly improve sample efficiency of RL agents (Ha & Schmidhuber, 2018; Holland et al., 2018; Pan et al., 2018; Janner et al., 2019; van Hasselt et al., 2019) . Dyna (Sutton, 1991) is a classical MBRL architecture where the agent uses real experience to updates its policy as well as its reward and dynamics models. In-between taking actions, the agent can get hypothetical experience from the model to further improve the policy. An important question for effective Dyna-style planning is search-control: from what states should the agent simulate hypothetical transitions? On each planning step in Dyna, the agent has to select a state and action from which to query the model for the next state and reward. This question, in fact, already arises in what is arguably the simplest variant of Dyna: Experience Replay (ER) (Lin, 1992) . In ER, visited transitions are stored in a buffer and at each time step, a mini-batch of experiences is sampled to update the value function. ER can be seen as an instance of Dyna, using a (limited) non-parametric model given by the buffer (see van Seijen & Sutton (2015) for a deeper discussion). Performance can be significantly improved by sampling proportionally to priorities based on errors, as in prioritized ER (Schaul et al., 2016; de Bruin et al., 2018) , as well as specialized sampling for the off-policy setting (Schlegel et al., 2019) . Search-control strategies in Dyna similarly often rely on using priorities, though they can be more flexible in leveraging the model rather than being limited to only retrieving visited experiences. For example, a model enables the agent to sweep backwards by generating predecessors, as in prioritized sweeping (Moore & Atkeson, 1993; Sutton et al., 2008; Pan et al., 2018; Corneil et al., 2018) . Other methods have tried alternatives to error-based prioritization, such as searching for states with high reward (Goyal et al., 2019) , high value (Pan et al., 2019) or states that are difficult to learn (Pan et al., 2020) . Another strategy is to directly generate hypothetical experiences from trajectory optimization algorithms (Gu et al., 2016) . These methods are all supported by nice intuition, but as yet lack solid theoretical reasons for why they can improve sample efficiency. In this work, we provide new insights about how to choose the sampling distribution over states from which we generate hypothetical experience. In particular, we theoretically motivate why errorbased prioritization is effective, and provide a mechanism to generate states according to more accurate error estimates. We first prove that l 2 regression with error-based prioritized sampling is equivalent to minimizing a cubic objective with uniform sampling in an ideal setting. We then show that minimizing the cubic power objective has a faster convergence rate during early learning stage, providing theoretical motivation for error-based prioritization. The theoretical understanding illuminates two issues of prioritized ER: insufficient sample space coverage and outdated priorities. To overcome the limitations, we propose a search-control strategy in Dyna that leverages a model to simulate errors and to find states with high expected error. Finally, we demonstrate the efficacy of our method on various benchmark domains and an autonomous driving application.

2. PROBLEM FORMULATION

We formalize the problem as a Markov Decision Process (MDP), a tuple (S, A, P, R, γ) including state space S, action space A, probability transition kernel P, reward function R, and discount rate γ ∈ [0, 1]. At each environment time step t, an RL agent observes a state s t ∈ S, and takes an action a t ∈ A. The environment transitions to the next state s t+1 ∼ P(•|s t , a t ), and emits a scalar reward signal r t+1 = R(s t , a t , s t+1 ). A policy is a mapping π : S × A → [0, 1] that determines the probability of choosing an action at a given state. The agent's objective is to find an optimal policy. A popular algorithm is Qlearning (Watkins & Dayan, 1992) , where parameterized action-values Q θ are updated using θ = θ + αδ t ∇ θ Q θ (s t , a t ) for step- size α > 0 with TD-error δ t def = r t+1 + γ max a ∈A Q θ (s t+1 , a ) -Q θ (s t , a t ). The policy is defined by acting greedily w.r.t. these action-values. ER is critical when using neural networks to estimate Q θ , as used in DQN (Mnih et al., 2015) , both to stabilize and speed up learning. MBRL has the potential to provide even further sample efficiency improvements. We build on the Dyna formalism (Sutton, 1991) for MBRL, and more specifically the recently proposed HC-Dyna (Pan et al., 2019) as shown in Algorithm 1. HC-Dyna provides a particular approach to search-control-the mechanism of generating states or state-action pairs from which to query the model to get next states and rewards (i.e. hypothetical experiences). It is characterized the fact that it generates states by hill climbing on some criterion function h(•). The term Hill Climbing (HC) is used for generality as the vanilla gradient ascent procedure is modified to resolve certain challenges (Pan et al., 2019) . Two particular choices have been proposed for h(•): the value function v(s) from Pan et al. (2019) and the gradient magnitude ||∇ s v(s)|| from Pan et al. (2020) . The former is used as measure of the utility of visiting a state and the latter is considered as a measure of value approximation difficulty. The hypothetical experience is obtained by first selecting a state s, then typically selecting the action a according to the current policy, and then querying the model to get next state s and reward r. These hypothetical transitions are treated just like real transitions. For this reason, HC-Dyna combines both real experience and hypothetical experience into mini-batch updates. These n updates, performed before taking the next action, are called planning updates, as they improve the action-value estimatesand so the policy-using a model. However, it should be noted that there are several limitations to the two previous works. First, the HC method proposed by Pan et al. (2019) is mostly supported by intuitions, without any theoretical justification to use the stochastic gradient ascent trajectories for search-control. Second, the HC on gradient norm and Hessian norm of the learned value function Pan et al. (2020) is supported by some suggestive theoretical evidence, but it suffers from great computation cost and zero gradient due to the high order differentiation (i.e., ∇ s ||∇ s v(s)||) as suggested by the authors. This paper will introduce our novel HC search-control method motivated by overcoming the limitations of the prioritized ER method, which has stronger theoretical support than the work by Pan et al. (2019) and improved computational cost comparing with the existed work by Pan et al. (2020) .

3. A DEEPER LOOK AT ERROR-BASED PRIORITIZED SAMPLING

In this section, we provide theoretical motivation for error-based prioritized sampling. We show that prioritized sampling can be reformulated as optimizing a cubic power objective with uniform sampling. We prove that optimizing the cubic objective provides a faster convergence rate during early learning. Based on these results, we highlight that prioritized ER has two limitations 1) outdated priorities and 2) insufficient coverage of the sample space. This motivates our method in the next section to address the two limitations.

3.1. PRIORITIZED SAMPLING AS A CUBIC OBJECTIVE

In the l 2 regression, we minimize the mean squared error min θ 1 2n n i=1 (f θ (x i ) -y i ) 2 , for training set T = {(x i , y i )} n i=1 and function approximator f θ , such as a neural network. In error-based prioritized sampling, we define the priority of a sample (x, y) ∈ T as |f θ (x) -y|; the probability of drawing a sample (x, y) ∈ T is typically q(x, y; θ) ∝ |f θ (x) -y|. We employ the following form to compute the probabilities: q(x, y; θ) def = |f θ (x) -y| n i=1 |f θ (x i ) -y i | We can show an equivalence between the gradients of the squared objective with this prioritization and the cubic power objective 1 3n n i=1 |f θ (x i ) -y i | 3 . See Appendix A. 3 for the proof. Theorem 1. For a constant c determined by θ, T , we have E (x,y)∼unif orm(T ) [∇ θ (1/3)|f θ (x) -y| 3 ] = cE (x,y)∼q(x,y;θ) [∇ θ (1/2)(f θ (x) -y) 2 ] This simple theorem provides an intuitive reason for why prioritized sampling can help improve sample efficiency: the gradient direction of cubic function is sharper than that of the square function when the error is relatively large (Figure 1 ). Theorem 2 further characterizes the difference between the convergence rates by optimizing the mean square error and the cubic power objective, providing a solid motivation for using error-based prioritized sampling. Theorem 2 (Fast early learning). Consider the following two objectives: 2 (x, y)  dx t dt = -η d{ 1 2 (x t -y) 2 } dx t , dx t dt = -η d{ 1 3 |x t -y| 3 } dx t . Given error threshold ≥ 0, define the hitting time t def = min t {t : δ t ≤ } and t def = min t {t : δt ≤ }. For any initial function value x 0 s.t. δ 0 > 1, ∃ 0 ∈ (0, 1) such that ∀ > 0 , t ≥ t . 1 Proof. Please see Appendix A.4. Given the same and the same initial value of x, first we derive t = 1 η • ln δ0 , t = 1 η • 1 -1 δ0 . Then we analyze the condition on to see when t ≥ t , i.e. minimizing the square error is slower than minimizing the cubic error. The above theorem says that when the initial error is relatively large, it is faster to get to a certain low error point with the cubic objective. We can test this in simulation, with the following minimization problems: min x≥0 x 2 and min x≥0 x 3 . We use the hitting time formulae t = 1 η • ln δ0 , t = 1 η • 1 -1 δ0 derived in the proof, to compute the hitting time ratio t t under different initial values x 0 and final error value . In Figure 1 (c)(d), we can see that it usually takes a significantly shorter time for the cubic loss to reach a certain x t with various x 0 values. 1 Finding the exact value of 0 would require a definition of ordering on complex plane, which leads to 0 = -1 W (log 1/a-1/a-πi) and W (•) is a Wright Omega function, then we have t ≤ t . Our theorem statement is sufficient for the purpose of characterizing convergence rate. Implications of the above theory. The equivalence from Theorem 1 inspires us to identify two limitations of the current prioritized ER method: 1) The equivalence requires the priorities of all samples to get updated after the training parameters get updated at each time step. 2) The equivalence requires the prioritized sampling distribution to be calculated on the whole training set; in an online RL setting, at the current time step t, we only have visited samples. These visited samples provide a biased training set w.r.t. current policy which likely does not reasonably cover the state space. We will present our approach to overcome the limitations in Section 4. In the next section, we will empirically verify our theoretical findings.

3.2. EMPIRICAL DEMONSTRATION

In this section, we empirically show: 1) the practical performance of the cubic objective; 2) the importance of having sufficient sample space coverage and of updating the priorities of all the training samples; and 3) the reasons for why we should not directly use a high power objective in general. We refer readers to A.6 for missing details and to A.7 for additional experiments. We conduct experiments on a supervised learning task. We generate a training set T , |T | = 4000 by uniformly sampling x ∈ [-2, 2] and adding zero-mean Gaussian noise with standard deviation σ to the target f sin (x) values, where f sin (x) = sin(8πx) if x ∈ [-2, 0) and f sin (x) = sin(πx) if x ∈ [0, 2]. The testing set contains 1k samples and the targets are not noise-contaminated. Pan et al. (2020) show that the high frequency region [-2, 0] is the main source of prediction error. Hence we expect prioritized sampling to make a clear difference in terms of sample efficiency on this dataset. We compare the following algorithms. L2: the l 2 regression with uniformly sampling from T . Full-PrioritizedL2: the l 2 regression with prioritized sampling according to the distribution defined in equation 1, the priorities of all samples in the training set are updated after each mini-batch update. PrioritizedL2: the only difference with Full-PrioritizedL2 is that only the priorities of those training examples sampled in the mini-batch are updated at each iteration, the rest of the training samples use the original priorities. This resembles the approach taken by vanilla Prioritized ER in the RL setting (Schaul et al., 2016) . Cubic: minimizing the cubic objective with uniformly sampling. Power4: min θ 1 n n i=1 (f θ (x i ) -y i ) 4 with uniformly sampling. We include it to show that there is almost no gain and potential harm by using higher powers. We use 32 × 32 tanh layers for all algorithms and optimize the learning rate from the range {0.01, 0.001, 0.0001}. Figure 2 (a)-(d) shows the learning curves in terms of testing error of all the above algorithms with various settings. 2 We identify five important observations: 1) with a small mini-batch size 128, there is a significant difference between Full-PrioritizedL2 and Cubic; 2) with increased mini-batch size, although all algorithms perform better, Cubic achieves largest improvement and its behavior tends to approximate the prioritized sampling algorithm; 3) as shown in Figure 2 (a), the prioritized sampling does not show advantage when the training set is small; 4) Prioritized l 2 without updating all priorities can be significantly worse than vanilla l 2 regression (uniform sampling); 5) when increasing the noise standard deviation σ from 0.1 to 0.5, all algorithms perform worse and the objectives with higher power get more hurt. The importance of sample space coverage. Observation 1) and 2) show that a high power objective has to use a much larger mini-batch size to achieve comparable performance with the l 2 with prioritized sampling. A possible reason is that prioritized sampling allows us to immediately get many samples from those high error region. Uniformly sampling, on the other hand, can get fewer of those samples with a limited mini-batch size. This motivates us to test prioritized sampling with a small training set where both algorithms get fewer samples. Figure 2 (a) together with (b) indicate that prioritized sampling needs sufficient samples across the sample space to maintain advantage. This requirement is intuitive but it illuminates a serious limitation of prioritized ER in RL: only those visited real experiences from the ER buffer can get sampled. If the state space is large, the ER buffer likely contains only a small subset of the state space, indicating a very small training set. Thorough priority updating. Observation 4) highlights the importance of using an up-to-date sampling distribution at each time step. Outdated priorities change the sampling distribution in an unpredictable manner and the learning performance can degrade. We further verify this phenomenon on the classical Mountain Car domain (Sutton & Barto, 2018; Brockman et al., 2016) . Figure 2 (e) shows the evaluation learning curves of different variants of Deep Q networks (DQN) corresponding to the supervised learning algorithms. We use a small 16 × 16 ReLu NN as the Q-function. We expect that a small NN should highlight the issue of priority updating: every mini-batch update potentially perturbs the values of many other states. Hence it is likely that many experiences in the ER buffer have the wrong priorities without thorough priority updating. We do indeed find this to be the case, with Full-PrioritizedER performing significantly better. Regarding high power objectives. As we discussed above, observation 1) and 2) tell us that that high power objective likely requires a large mini-batch size. Ideally, it would use a batch algorithm, i.e. the whole training set, for the improved convergence rate to manifest. This requirement makes the algorithm not scalable to larger training dataset. Observation 5) indicates another reason for why a high power objective should not be preferred: it augments the effect of noise added to the target variables. In Figure 2 (d), the Power4 objective suffers most from the increased target noise.

4. ADDRESSING THE LIMITATIONS PRIORITIZED REPLAY: ACQUIRING SAMPLES VIA SIMULATED PRIORITIES ON CONTINUOUS DOMAINS

In this section, we propose a method to mitigate the limitations of the conventional prioritized ER method mentioned in the above section. We start by the following theorem. We denote P π (s , r|s) as the transition probability given a policy π. Theorem 3. Sampling method. Given the state s ∈ S, let v π (•; θ) : S → R be a differentiable value function under policy π parameterized by θ. Define: y(s) def = E r,s ∼P π (s ,r|s) [r + γv π (s ; θ)], and denote the TD error as δ(s, y; θ t ) def = y(s) -v(s; θ t ). Given some initial state s 0 ∈ S, define the state sequence {s i } as the one generated by state updating rule s i+1 ← s i + α a ∇ s log |δ(s i , y(s i ); θ t )| + X i , where α a is a sufficiently small stepsize and X i is a Gaussian random variable with some constant variance.foot_2 Then the sequence {s i } converges to the distribution p(s) ∝ |δ(s, y(s))|. The proof is a direct consequence of the convergent behavior of Langevin dynamics stochastic differential equation (SDE) (Roberts, 1996; Welling & Teh, 2011; Zhang et al., 2017) . We include a brief discussion and background knowledge in the Appendix A.2. In practice, we can compute the state value estimate by v(s) = max a Q(s, a; θ t ) as suggested by Pan et al. (2019) . In the case that a true environment model is not available, we have to compute an estimate ŷ(s) of y(s) by a learned model. Then at each time step t, states approximately following the distribution p(s) ∝ |δ(s, y(s))| can be generated by s ← s + α a ∇ s log |ŷ(s) -max a Q(s, a; θ t )| + X, where X is a Gaussian random variable with zero-mean and some small variance. In the implementation, observing that α a is small, we consider ŷ(s) as a constant given a state s without backpropagating through it. We provide an upper bound in the below theorem for the difference between the sampling distribution acquired by the true model and the learned model. We denote the transition probability distribution under policy π and the true model as P π (r, s |s), and the learned model as Pπ (r, s |s). Let p(s) and p(s) be the convergent distributions described in Theorem 3 by using the true and learned models respectively. Let d tv (•, •) be the total variation distance between the two probability distributions. Define u(s) def = |δ(s, y(s))|, û(s) def = |δ(s, ŷ(s))|, Z def = s∈S u(s)ds, Ẑ def = s∈S û(s)ds. Then we have the following bound. Please see Appendix A.5 for the proof and further interpretations. Theorem 4. Assume: 1) the reward magnitude is bounded |r| ≤ R max and define V max def = Rmax 1-γ ; 2) the largest model error for a single state is some small value: s def = max s d tv (P π (•|s), Pπ (•|s)) and the total model error is bounded, i.e. def = s∈S s ds < ∞. Then ∀s ∈ S, |p(s) -p(s)| ≤ min( Vmax(p(s) + s) Ẑ , Vmax( p(s) + s) Z ). Algorithmic details. We present our algorithm called Dyna-TD (Temporal Difference error) in the Algorithm 3 in Appendix A.6. Our algorithm follows Algorithm 1, particularly, we choose the function h(s) def = log |ŷ(s) -max a Q(s, a; θ t )|, i.e. run the updating rule 3 to generate states. Empirical verification of sampling distribution. We validate the efficacy of our sampling method by empirically examining the distance between the sampling distribution acquired by our gradient ascent rule in equation 3 (denoted as p 1 (•)) and the desired distribution computed by thorough priority updating p * (•) of all states under the current parameter on the GridWorld domain (Pan et al., 2019) (Figure 3(a) ), where the probability density can be conveniently approximated by discretization. We record the distance change when we train our Algorithm 3. The distance between the sampling distribution fo Prioritized ER (denoted as p 2 (•)) is also included for comparison. All those distributions are computed by normalizing visitation counts on the discretized 50×50 GridWorld. We compute the distances of p 1 , p 2 to p * by two sensible weighting schemes: 1) on-policy weighting: 2500 j=1 d π (s j )|p i (s j ) -p * (s j )|, i ∈ {1, 2} , where d π is approximated by uniformly sample 3k states from a recency buffer; 2) uniform weighting: 1 2500 2500 j=1 |p i (s j ) -p * (s j )|, i ∈ {1, 2}. All details are in Appendix A.6 Figure 3(b)(c ) shows that our algorithm Dyna-TD, either with a true or an online learned model, maintains a significantly closer distance to the desired sampling distribution p * than PrioritizedER under both weighting schemes. Furthermore, despite the mismatch between implementation and our above Theorem 3-namely that Dyna-TD may not run enough gradient steps to reach stationary distribution-the induced sampling distribution is quite close to the one by running long gradient steps (Dyna-TD-Long), which we expect to reach stationary behavior. This indicates that the we can reduce the time cost by lowering the number of gradient steps, while keep the sampling distribution similar. In Figure 3 (d), we further verify that given the same time budget, our algorithm achieves better performance, despite the fact that DQN and PrioritizedER are able to process many more samples. This makes the additional time spent on search-control worth it. Sample space coverage. To further illuminate that our method indeed enables a broader coverage of the model-free ER method, we visualize the DQN's ER state distributions trained with and without prioritization respectively and our algorithm's search-control queue state distribution. Figure 4 shows that there is a significant difference between ER's and our queue's distributions. Specifically, our search-control queue distribution looks more uniformly distributed across the whole state space with slightly higher density in the middle. This concentration corresponds to Figure 5 by Pan et al. (2020) , as the agent must learn to pass the small hole (i.e., a bottleneck region) to get to the goal area. The significantly broader coverage of our search-control queue distribution possibly explains the superior performance of our algorithm in Figure 5 (c)(d) . 

5. EXPERIMENTS

In this section, we empirically show that our algorithm achieves stable and consistent performance across different settings. We first show the overall comparative performance on various benchmark domains. We then show that our algorithm Dyna-TD is more robust to environment noise than PrioritizedER. Last, we demonstrate the practical utility of our algorithm on an autonomous driving vehicle application. Note that Dyna-TD uses the same hill climbing parameter settings across all benchmark domains. We refer readers to the Appendix A.6 for any missing details. Baselines. We include the following baseline competitors. ER is DQN with a regular ER buffer without prioritized sampling. PrioritizedER uses a priority queue to store visited experiences and each experience is sampled proportionally to its TD error magnitude. Note that, as per the original paper (Schaul et al., 2016) , after each mini-batch update, only the priorities of those samples in the mini-batch are updated. Dyna-Value (Pan et al., 2019) is the Dyna variant which performs hill climbing on value function to acquire states to populate the search-control queue. Dyna-Frequency (Pan et al., 2020) is the Dyna variant which performs hill climbing on the norm of the gradient of the value function to acquire states to populate the search-control queue. Overall Performance. Figure 5 shows the overall performance of different algorithms on Acrobot, CartPole, GridWorld (Figure 3(a) ) and MazeGridWorld (Figure 5 (g)). Our key observations are: 1) Dyna-Value or Dyna-Frequency may converge to a sub-optimal policy when using a large number of planning steps; 2) Dyna-Frequency has clearly inconsistent performance across different domains; 3) our algorithm performs the best in most cases: even with an online learned model, our algorithm outperforms others on most of the tasks. 2020) and the learning curves are in (h). On MazeGW, we do not show model-free baselines as it is reported that model-free baselines do significantly worse than Dyna variants (Pan et al., 2020) . We do reproduce the result of Dyna-Frequency from that paper. Our interpretations of those observations are as follows. First, we can think about the case where some states have high value but low TD error. Dyna-Value may still frequently generate those states; this can waste samples and even incur sampling distribution bias, which can lead to a sub-optimal policy. This sub-optimality can be clearly observed on Acrobot, GridWorld and MazeGridWorld. Similar reasoning applies to Dyna-Frequency. Second, for Dyna-Frequency, as indicated by the original paper Pan et al. (2020) , the gradient or Hessian norm have very different numerical scales and highly depend on the choice of the function approximator or domain. This indicates that the algorithm requires finely tuned parameter settings, as the testing domain is varied, which possibly explains its inconsistent performances across domains. Third, notice that even though each algorithm runs the same number of planning steps, the model-based algorithms perform significantly better. This provides evidence for the benefits of leveraging the generalization power of the learned value function and a model. In contrast, model-free methods can only utilize visited states. Robustness to Noise. As a corresponding experiment to the supervised learning setting in Section 3, we show that our algorithm is more robust to increased noise variance than PrioritizedER. Figure 6 shows the evaluation learning curves on Mountain Car with planning steps 10, 30 and reward noise standard deviation σ ∈ {0, 0.1}. We would like to identify three key observations. First, our algorithm's relative performance to PrioritizedER resembles the Full-PrioritizedL2 to PrioritizedL2 from the supervised learning setting, as Full-PrioritizedL2 is more robust to target noise than PrioritizedL2. Second, our algorithm achieves almost the same performance as Dyna-Frequency which is claimed to be robust to noise by Pan et al. (2020) . Last, as observed on other environments, usually all algorithms can benefit from the increased number of planning steps; however, PrioritizedER and ER clearly degrade when using more planning steps with noise present. Practical Utility in Autonomous Driving Application. We study the practical utility of our method in an autonomous driving application (Leurent, 2018) with an online learned model. As shown in Figure 7 (a), we test on the roundabout-v0 domain, where the agent (i.e. the green car) should learn to go through a roundabout without collisions while maintaining as high speed as possible. We would like to emphasize that there is a significantly lower number of car crashes with the policy learned by our algorithm on both domains as we show in Figure 7(b) . This coincides with our intuition. The crash should incur high temporal difference error and our method of actively searching such states by gradient ascent. This ensures the agent gets sufficient training in this states during the planning stage, so that it learns to avoid them. 

6. DISCUSSION

In this work, we provide theoretical justification for why prioritized ER can help improve sample efficiency. We identify crucial factors for it to be effective: sample space coverage and thorough priority updating. We then propose to sample states by Langevin dynamics and conduct experiments to show our method's efficacy. There are several interesting directions for future work. One is to study the effects of model error on sample efficiency with our search control strategy. Another is to apply our method with a feature-to-feature model, which can improve our method's scalability. On the theory side, our cubic objective explains the original TD-error based prioritized ER (Schaul et al., 2016) . However, there are other types of choices beyond TD-error based prioritization, such as distribution location or reward-based prioritization (Lambert et al., 2020) . Whether these alternative prioritizations can also be formulated as surrogate objectives are interesting future directions. In a concurrent work, Fujimoto et al. (2020) established an equivalence between loss functions and sampling distributions, which bears similarities to our Theorem 1. However, it is not clear if similar optimization benefits shown in our Theorem 2 are enjoyed by more general loss functions and sampling distributions, which requires further investigations.

A APPENDIX

In Section A.1, we introduce some background in Dyna architecture. We briefly discuss Langevin dynamics and its computation cost in our case in Section A.2. We then provide the full proof of Theorem 2 in Section A.4. We present the proof for Theorem 4 in Section A.5. Details for reproducible research are in Section A.6. We provide supplementary experimental results in Section A.7. A.1 BACKGROUND IN DYNA Dyna integrates model-free and model-based policy updates in an online RL setting (Sutton, 1990) . As shown in Algorithm 2, at each time step, a Dyna agent uses the real experience to learn a model and performs model-free policy update. During the planning stage, simulated experiences are acquired from the model to further improve the policy. It should be noted that the concept of planning refers to any computational process which leverages a model to improve policy, according to Sutton & Barto (2018) . The mechanism of generating states or state-action pairs from which to query the model is called search-control, which is of critical importance to the sample efficiency. There are abundant existing works (Moore & Atkeson, 1993; Sutton et al., 2008; Gu et al., 2016; Pan et al., 2018; Corneil et al., 2018; Goyal et al., 2019; Janner et al., 2019; Pan et al., 2019) report different level of sample efficiency improvements by using different way of generating hypothetical experiences during the planning stage.  dW (t) = ∇U (W t )dt + √ 2dB t , where B t ∈ R d is a d-dimensional Brownian motion and U is a continuous differentiable function. It turns out that the Langevin diffusion (W t ) t≥0 converges to a unique invariant distribution p(x) ∝ exp (U (x)) (Chiang et al., 1987) . By applying the Euler-Maruyama discretization scheme to the SDE, we acquire the discretized version Y k+1 = Y k + α k+1 ∇U (Y k ) + √ 2α k+1 Z k+1 where (Z k ) k≥1 is an i.i.d. sequence of standard ddimensional Gaussian random vectors and (α k ) k≥1 is a sequence of step sizes. It has been proved that the limiting distribution of the sequence (Y k ) k≥1 converges to the invariant distribution of the underlying SDE Roberts (1996) ; Durmus & Moulines (2017) . As a result, considering U (•) as δ(•), Y as s completes the proof for Theorem 3. Computational time cost. It should be noted that the Langevin Dynamics Monte Carlo method we used for generating states does introduce additional computation time cost. However, in the main body of the paper, we already show that the time cost worths 3(d). In theory, the computational time is reasonably small. Although each gradient ascent step takes one backpropagation, this backpropagation is w.r.t. a single state, not a mini-batch. Let the mini-batch size of updating DQN be b, and the number of gradient steps be k. If we assume one mini-batch update takes O(c), then the time cost of Dyna-TD is O(kc/b). We would like to highlight that, though our approach incur higher computational cost, but it is able use fewer samples (i.e. fewer physical interactions with real environment) to achieve better performance. The cost of physical interactions could be higher in practical situations. When really needed, there are intuitive engineering tricks to reduce the computational cost. For example, we can learn a low dimensional embedding first and build a model in such space; and then the gradient ascent can be done w.r.t. the feature instead of the raw input/observation itself. To further save time, we can also sacrifice a bit the accuracy and do the gradient ascent every certain number of time steps to lower down the amortized cost. Reducing time cost is not our current focus.

A.3 PROOF FOR THEOREM 1

Theorem 1. For a constant c determined by θ, T , we have E (x,y)∼unif orm(T ) [∇ θ (1/3)|f θ (x) -y| 3 ] = cE (x,y)∼q(x,y;θ) [∇ θ (1/2)(f θ (x) -y) 2 ] Proof. The proof is very intuitive. The expected gradient of the uniform sampling method is  E (x,y)∼unif orm(T ) [∇ θ (1/3)|f θ (x) -y| 3 ] = 1 n n i=1 |f θ (x i ) -y i |∇ θ (f θ (x i ) -y i ) 2 E (x,y)∼q(x,y;θ) [∇ θ (1/2)(f θ (x) -y) 2 ] = n i=1 q(x i , y i ; θ)∇ θ (f θ (x i ) -y i ) 2 = 1 n i=1 |f θ (x i ) -y i | n i=1 |f θ (x i ) -y i |∇ θ (f θ (x) -y) 2 = n n i=1 |f θ (x i ) -y i | E (x,y)∼unif orm(T ) [∇ θ (1/3)|f θ (x) -y| 3 ] Setting c = dx t dt = -η d{ 1 2 (x t -y) 2 } dx t , dx t dt = -η d{ 1 3 |x t -y| 3 } dx t . Given error threshold ≥ 0, define the hitting time t def = min t {t : δ t ≤ } and t def = min t {t : δt ≤ }. For any initial function value x 0 s.t. δ 0 > 1, ∃ 0 ∈ (0, 1) such that ∀ > 0 , t ≥ t . Proof. For the gradient flow update on the 2 objective, we have, d 2 (x t , y) dt = d 2 (x t , y) dδ t • dδ t dx t • dx t dt (5) = δ t • sgn(x t -y) • [-η • (x t -y)] (6) = δ t • sgn(x t -y) • [-η • sgn(x t -y) • δ t ] (7) = -η • δ 2 t = -2 • η • 2 (x t , y). ( ) which implies, d{ln 2 (x t , y)} dt = 1 2 (x t , y) • d 2 (x t , y) dt = -2 • η. Taking integral, we have, ln 2 (x t , y) -ln 2 (x 0 , y) = -2 • η • t, which is equivalent to (letting δ t = ), t def = 1 2η • ln 2 (x 0 , y) 2 (x t , y) = 1 η • ln δ 0 δ t = 1 η • ln δ 0 . ( ) d 3 (x t , y) dt = d 3 (x t , y) d δt • d δt dx t • dx t dt (12) = δ2 t • sgn(x t -y) • -η • δ2 t • sgn(x t -y) (13) = -η • δ4 t = -3 4 3 • η • ( 3 (x t , y)) 4 3 , which implies, d{( 3 (x t , y)) -1 3 } dt = - 1 3 • ( 3 (x t , y)) -4 3 • d 3 (x t , y) dt = 3 1 3 • η. ( ) Taking integral, we have, ( 3 (x t , y)) -1 3 -( 3 (x 0 , y)) -1 3 = 3 1 3 • η • t, which is equivalent to (letting δt = ), t def = 1 3 1 3 • η • ( 3 (x t , y)) -1 3 -( 3 (x 0 , y)) -1 3 = 1 η • 1 δt - 1 δ 0 = 1 η • 1 - 1 δ 0 . ( ) Then we have, t -t = 1 η • ln δ 0 - 1 η • 1 - 1 δ 0 (18) = 1 η • ln 1 - 1 -ln 1 δ 0 - 1 δ 0 . Define the function f (x) = ln 1 x -1 x , x > 0 is continuous and max x>0 f (x) = f (1) = -1. We have lim x→0 f (x) = lim x→∞ f (x) = -∞, and f (•) is monotonically increasing for x ∈ (0, 1] and monotonically decreasing for x ∈ (1, ∞). Given δ 0 > 1, we have f (δ 0 ) < f (1) = -1. Using the intermediate value theorem for f (•) on (0, 1], we have ∃ 0 < 1, such that f ( 0 ) = f (δ 0 ). Since f (•) is monotonically increasing on (0, 1] and monotonically decreasing on (1, ∞), for any ∈ [ 0 , δ 0 ], we have f ( ) ≥ f (δ 0 ).foot_3 Hence we have, t -t = 1 η • [f ( ) -f (δ 0 )] ≥ 0. Remark 1. Figure 8 shows the function f (x) = ln 1 x -1 x , x > 0. Fix arbitrary x > 1, there will be another root 0 < 1 s.t. f ( 0 ) = f (x ). However, there is no real-valued solution for 0 . The solution in C is 0 = -1 W (log 1/δ0-1/δ0-πi) , where W (•) is a Wright Omega function. Hence, finding the exact value of 0 would require a definition of ordering on complex plane. Our current theorem statement is sufficient for the purpose of characterizing convergence rate. The theorem states that there always exists some desired low error level < 1, minimizing the square loss converges slower than the cubic loss.

A.5 PROOF FOR THEOREM 4

We now provide the error bound for Theorem 4. We denote the transition probability distribution under policy π with the true model as P π (r, s |s); denote that with the learned model as Pπ (r, s |s). Let p(s) and p(s) be the convergent distributions described in Theorem 3 by using true model and learned model respectively. Let d tv (•, •) be the total variation distance between two probability distributions. Define u(s)  ≤(R max + γ R max 1 -γ ) s,r (P π (s , r|s) -Pπ (s , r|s))ds dr ≤V max d tv (P π (•|s), Pπ (•|s)) ≤ V max s Now, we show that |Z -Ẑ| ≤ V max . |Z -Ẑ| = | s∈S u(s)ds - s∈S û(s)ds| = | s∈S (u(s) -û(s))ds| ≤ s∈S |u(s) -û(s)|ds ≤ V max s∈S s ds = V max Consider the case p(s) > p(s) first. p(s) -p(s) = u(s) Z - û(s) Ẑ ≤ u(s) Z - u(s) -V max s Ẑ = u(s) Ẑ -u(s)Z + ZV max s Z Ẑ ≤ u(s)V max + ZV max s Z Ẑ = V max (p(s) + s ) Ẑ Meanwhile, below inequality should also hold: p(s) -p(s) = u(s) Z - û(s) Ẑ ≤ û(s) + V max s Z - û(s) Ẑ = û(s) Ẑ -û(s)Z + ẐV max s Z Ẑ ≤ V max (p(s) + s ) Z Because both the two inequalities must hold, when p(s) -p(s) > 0, we have: p(s) -p(s) ≤ min( V max (p(s) + s ) Ẑ , V max (p(s) + s ) Z ) It turns out that the bound is the same when p(s) ≤ p(s). This completes the proof. Remark. This bound actually indicates that |p(s) -p(s)| should be small. Because if p(s) is much larger p, then we may expect the second term in the min function would be chosen. During early learning, although the model error can be large, but Ẑ, Z should be also very large. The total model error is scaled by p(s) or p(s) and it should be small. We may expect a nice approximation even when the model is not that perfectly learned.

A.6 REPRODUCIBLE RESEARCH

Our implementations are based on tensorflow with version 1.13.0 (Abadi et al., 2015) . We use Adam optimizer (Kingma & Ba, 2014) for all experiments. A.6.1 REPRODUCE EXPERIMENTS BEFORE SECTION 5 Supervised learning experiment. For the supervised learning experiment shown in section 3, we use 32 × 32 tanh units neural network, with learning rate swept from {0.01, 0.001, 0.0001, 0.00001} for all algorithms. We compute the constant c as specified in the Theorem 1 at each time step for Cubic loss. We compute the testing error every 500 iterations/mini-batch updates and our evaluation learning curves are plotted by averaging 50 random seeds. For each random seed, we randomly split the dataset to testing set and training set and the testing set has 1k data points. Note that the testing set is not noise-contaminated. Reinforcement Learning experiments in Section 3. We use a particularly small neural network 16 × 16 to highlight the issue of incomplete priority updating. Intuitively, a large neural network may be able to memorize each state's value and thus updating one state's value is less likely to affect others. We choose a small neural network, in which case a complete priority updating for all states should be very important. We set the maximum ER buffer size as 10k and mini-batch size as 32. The learning rate is 0.001 and the target network is updated every 1k steps. Distribution distance computation in Section 4. We now introduce the implementation details for Figure 3 . The distance is estimated by the following steps. First, in order to compute the desired sampling distribution, we discretize the domain into 50 × 50 grids and calculate the absolute TD error of each grid (represented by the left bottom vertex coordinates) by using the true environment model and the current learned Q function. We then normalize these priorities to get probability distribution p * . Note that this distribution is considered as the desired one since we have access to all states across the state space with priorities computed by current Q-function at each time step. Second, we estimate our sampling distribution by randomly sampling 3k states from search-control queue and count the number of states falling into each discretized grid and normalize these counts to get p 1 . Third, for comparison, we estimate the sampling distribution of the conventional prioritized ER (Schaul et al., 2016) by sampling 3k states from the prioritized ER buffer and count the states falling into each grid and compute its corresponding distribution p 2 by normalizing the counts. Then we compute the distances of p 1 , p 2 to p * by two weighting schemes: 1) on-policy weighting: 2500 j=1 d π (s j )|p i (s j ) -p * (s j )|, i ∈ {1, 2}, where d π is approximated by uniformly sample 3k states from a recency buffer and normalizing their visitation counts on the discretized GridWorld; 2) uniform weighting: 1 2500 2500 j=1 |p i (s j )-p * (s j )|, i ∈ {1, 2}. We examine the two weighting schemes because of two considerations: for the on-policy weighting, we concern about the asymptotic convergent 



We show the testing error as it is the primary concern. The training error has similar comparative performance and is presented in Appendix 3, where we also include additional results with different settings. The stepsize and variance affects the temperature parameter. We avoid introducing too much notation here and simply treat the two as a hyper-parameters in the implementation. We fix one setting across experiments. Note that < δ0 by the design of using gradient descent updating rule. If the two are equal, t = t = 0 holds trivially. Note that this is one of its disadvantages: the search-control of Dyna-Frequency requires the computation of Hessian-gradient product and it is empirically observed that the Hessian is frequently zero when using ReLu as hidden units(Pan et al., 2020).



|x t -y|, and δt def = |x t -y|. Define the functional gradient flow updates on these two objectives:

Figure 1: (a) show cubic v.s. square function. (b) shows their absolute derivatives. (c) shows the hitting time ratio v.s. initial value x0 under different target value xt. (d) shows the ratio v.s. the target xt to reach under different x0. Note that a ratio larger than 1 indicates a longer time to reach the given xt for the square loss.

Figure 2: Testing RMSE v.s. number of mini-batch updates. (a)(b)(c)(d) show the learning curves with different mini-batch size b or Guassian noises variance σ added to the training targets. (a) is using σ = 0.1 and a smaller training set (solid line for |T | = 800, dotted line for |T | = 1600) than others but has the same testing set size. (e) shows the a corresponding experiment in RL setting on the classical mountain car domain. The results are averaged over 50 random seeds on (a)-(d) and 30 on (e). The shade indicates standard error.

Figure 3: (a) shows the GridWorld taken from Pan et al. (2019). The state space is S = [0, 1] 2 , and the agent starts from the left bottom and should learn to take action from A = {up, down, right, lef t} to reach the right top within as few steps as possible. (b) shows the distance change as a function of training steps. The dashed line corresponds to our algorithm with an online learned model. The corresponding evaluation learning curve is in the Figure 5(c). (d) shows the policy evaluation performance as a function of running time (seconds). All results are averaged over 20 random seeds and the shade indicates standard error.

Figure 4: (a) (b) shows the ER buffer state distributions trained by regular ER and prioritized ER respectively. (c) shows the search-control queue state distribution of our Dyna-TD algorithm after training for the same number of environment time steps. It can be seen that our algorithm has a much broader coverage of the sample space.

Figure 5: Evaluation learning curves on benchmark domains with planning updates n = 10, 30. The dashed line denotes Dyna-TD with an online learned model. All results are averaged over 20 random seeds. Figure(g) shows MazeGridWorld(GW) taken from Pan et al. (2020) and the learning curves are in (h). On MazeGW, we do not show model-free baselines as it is reported that model-free baselines do significantly worse than Dyna variants(Pan et al., 2020). We do reproduce the result of Dyna-Frequency from that paper.

Figure 6: Evaluation learning curves on Mountain Car with different number of planning updates and different reward noise variance. At each time step, the reward is sampled from the Gaussian N (-1, σ). σ = 0 indicates deterministic reward. All results are averaged over 20 random seeds.

Figure 7: (a) shows the roundabout domain, where S ⊂ R 90 . (b) shows the corresponding evaluation learning curves in terms of number of car crashes as a function of driving time steps. The results are averaged over 50 random seeds. The shade indicates standard error.

Tabular Dyna Initialize Q(s, a); initialize model M(s, a), ∀(s, a) ∈ S × A while true do observe s, take action a by -greedy w.r.t Q(s, •) execute a, observe reward R and next State s Q-learning update for Q(s, a) update model M(s, a) (i.e. by counting) store (s, a) into search-control queue for i=1:d do sample (s, ã) from search-control queue (s , R) ← M(s, ã) // simulated transition Q-learning update for Q(s, ã) // planning update A.2 DISCUSSION ON THE LANGEVIN DYNAMICS Define a SDE:

Consider the following two objectives: 2 (x, y) |x t -y|, and δt def = |x t -y|. Define the functional gradient flow updates on these two objectives:

def = |δ(s, y(s))|, û(s) def = |δ(s, ŷ(s))|, Z def = s∈S u(s)ds, Ẑ def = s∈S û(s)ds. Then we have the following bound.

Figure 8: The function f (x) = ln 1x -1 x , x > 0. The function reaches maximum at x = 1.

Figure 10: Figure(a)(b)(c) show the testing RMSE as a function of number of mini-batch updates with different mini-batch sizes or Guassian noises with different σ added to the training targets. (d)(e)(f) show the training RMSE.The results are averaged over 50 random seeds. The standard error is small enough to get ignored. Note that the target variable in the testing set is not noise-contaminated.

Figure 11: Figure(a)(b)(c) show the training RMSE as a function of number of mini-batch updates on the Bike sharing dataset. The results are averaged over 20 random seeds. The shade indicates standard error.

annex

behavior and want to down-weight those states with relatively high TD error but get rarely visited as the policy gets close to optimal; uniform weighting makes more sense during early learning stage, where we consider all states are equally important and want the agents to sufficiently explore the whole state space.

A.6.2 REPRODUCE EXPERIMENTS IN SECTION 5

For our algorithm, the pseudo-code with concrete parameter settings is presented in Algorithm 4.Common settings. For all domains other than roundabout-v0, we use 32 × 32 neural network with ReLu hidden units except the Dyna-Frequency which uses tanh units as suggested by the author (Pan et al., 2020) . 5 Except the output layer parameters which were initialized from a uniform distribution [-0.003, 0.003], all other parameters are initialized using Xavier initialization (Glorot & Bengio, 2010) . We use mini-batch size b = 32 and maximum ER buffer size 50k. All algorithms use target network moving frequency 1000 and we sweep learning rate from {0.001, 0.0001}. We use warm up steps = 5000 (i.e. random action is taken in the first 5k time steps) to populate the ER buffer before learning starts. We keep exploration noise as 0.1 without decaying.Meta-parameter settings. For our algorithm Dyna-TD, we are able to keep the same parameter setting across all benchmark domains: α a = 0.1, c = 20 and learning rate 0.001. For all Dyna variants, we fetch the same number of states (m = 20) from hill climbing (i.e. search-control process) as Dyna-TD does, and use a = 0.1 and set the maximum number of gradient step as k = 100 unless otherwise specified.Our Prioritized ER is implemented as the proportional version with sum tree data structure. To ensure fair comparison, since all model-based methods are using mixed mini-batch of samples, we use prioritized ER without importance ratio but half of mini-batch samples are uniformly sampled from the ER buffer as a strategy for bias correction. For Dyna-Value and Dyna-Frequency, we use the setting as described by the original papers.For the purpose of learning an environment model, we use a 64 × 64 ReLu units neural network to predict s -s and reward given a state-action pair s, a; and we use mini-batch size 128 and learning rate 0.0001 to minimize the mean squared error objective for training the environment model.Environment-specific settings. All of the environments are from OpenAI (Brockman et al., 2016) except that: 1) the GridWorld envirnoment is taken from Pan et al. (2019) and the MazeGridWorld is from Pan et al. (2020) ; 2) Roundabout-v0 is from Leurent et al. (2019) . For all OpenAI environments, we use the default setting except on Mountain Car where we set the episodic length limit to 2k. The GridWorld has state space S = [0, 1] 2 and each episode starts from the left bottom and the goal area is at the top right [0.95, 1.0] 2 . There is a wall in the middle with a hole to allow the agent to pass. MazeGridWorld is a more complicated version where the state and action spaces are the same as GridWorld, but there are two walls in the middle and it takes a long time for model-free methods to be successful. On the this domain, we use the same setting as the original paper for all Dyna variants. We use exactly the same setting as described above except that we change the Qnetwork size to 64 × 64 ReLu units, and number of search-control samples is m = 50 as used by the original paper. We refer readers to the original paper (Pan et al., 2020) for more details.On roundabout-v0 domain, we use 64 × 64 ReLu units for all algorithms and set mini-batch size as 64. For Dyna-TD, we start using the model after 5k steps and set m = 100, α a = 1.0, k = 500 and we do search-control every 10 environment time steps to reduce computational cost. To alleviate the effect of model error, we use only 16 out of 64 samples from the search-control queue in a mini-batch.

A.7 ADDITIONAL EXPERIMENTS

The main purpose of the additional experiments here is to strengthen our claims from Section 3.Learning curve in terms of training error corresponding to Figure 2 . In Section 3.1, we show the learning curve in terms of testing error. We now show training error to closely match our theoretical result 1. As a supplement, we include the learning curve in terms of testing error in Figure 9 . Learning curve with a larger neural network on the sin dataset. We try to eliminate the effect of neural network size. Hence we use a larger neural network size (128 × 128 tanh units) on the same sin dataset. As one can see from Figure 10 , FullPrioritizedL2 still performs the best and when we increase the mini-batch size from 128 to 512, the high power objective versions still moves closer to FullPrioritizedL2, as we saw in Figure 2 .Learning curve with on a real-world dataset. To illustrate the generality of our Theorem 1, we also conduct tests on a frequently cited regression Bike sharing dataset Fanaee-T & Gama (2013). The data preprocessing is as follows. We remove attributes: date, index, year, weather situation 4, weekday 7, registered, casual. We use one-hot encoding for all categorical variables. We scale the target to [0, 1] and scale it back when computing training errors. We use a 64 × 64 ReLu units neural network with mini-batch size 128 and learning rate 0.0001 for training.It should be noted that the behavior of Cubic is consistent on the previous sin example and on this real world dataset: it gets closer to FullPrioritizedL2 as we increase the mini-batch size. Another observation is that the Power4 objective is highly variant on this domain, because the real world data should be noisy and the high order objective suffers. This observation corresponds to what we observed in Section 3.1, where we show that high power objective is sensitive to the noise. 

