REVISITING PRIORITIZED EXPERIENCE REPLAY: A VALUE PERSPECTIVE

Abstract

Reinforcement learning (RL) agents need to learn from past experiences. Prioritized experience replay that weighs experiences by their surprise (the magnitude of the temporal-difference error) significantly improves the learning efficiency for RL algorithms. Intuitively, surprise quantifies the unexpectedness of an experience to the learning agent. But how surprise is related to the importance of experience is not well understood. To address this problem, we derive three value metrics to quantify the importance of experience, which consider the extra reward would be earned by accessing the experience. We theoretically show these value metrics are upper-bounded by surprise for Q-learning. Furthermore, we successfully extend our theoretical framework to maximum-entropy RL by deriving the lower and upper bounds of these value metrics for soft Q-learning, which turn out to be the product of surprise and "on-policyness" of the experiences. Our framework links two important quantities in RL, i.e., surprise and value of experience, and provides a theoretical basis to estimate the value of experience by surprise. We empirically show that the bounds hold in practice, and experience replay using the upper bound as priority improves maximum-entropy RL in Atari games.

1. INTRODUCTION

Learning from important experiences prevails in nature. In rodent hippocampus, memories with higher importance, such as those associated with rewarding locations or large reward-prediction errors, are replayed more frequently (Michon et al., 2019; Roscow et al., 2019; Salvetti et al., 2014) . People who have more frequent replay of high-reward associated memories show better performance in memory tasks (Gruber et al., 2016; Schapiro et al., 2018) . A normative theory suggests that prioritized memory access according to the utility of memory explains hippocampal replay across different memory tasks (Mattar & Daw, 2018) . As accumulating new experiences is costly, utilizing valuable past experiences is a key for efficient learning ( Ólafsdóttir et al., 2018) . Differentiating important experiences from unimportant ones also benefits reinforcement learning (RL) algorithms (Katharopoulos & Fleuret, 2018) . Prioritized experience replay (PER) (Schaul et al., 2016) is an experience replay technique built on deep Q-network (DQN) (Mnih et al., 2015) , which weighs the importance of samples by their surprise, the magnitude of the temporal-difference (TD) error. As a result, experiences with larger surprise are sampled more frequently. PER significantly improves the learning efficiency of DQN, and has been adopted (Hessel et al., 2018; Horgan et al., 2018; Kapturowski et al., 2019) and extended (Daley & Amato, 2019; Pan et al., 2018; Schlegel et al., 2019) by various deep RL algorithms. Surprise quantifies the unexpectedness of an experience to a learning agent, and biologically corresponds to the signal of reward prediction error in dopamine system (Schultz et al., 1997; Glimcher, 2011) , which directly shapes the memory of animal and human (Lisman & Grace, 2005; McNamara et al., 2014) . However, how surprise is related to the importance of experience in the context of RL is not well understood. We address this problem from an economic perspective, by linking surprise to value of experience in RL. The goal of RL agent is to maximize the expected cumulative reward, which is achieved through learning from experiences. For Q-learning, an update on an experience will lead to a more accurate prediction of the action-value or a better policy, which increases the expected cumulative reward the agent may get. We define the value of experience as the increase in the expected cumulative reward resulted from updating on the experience (Mattar & Daw, 2018) . The value of experience quantifies the importance of experience from first principles: assuming that the agent is economically rational and has full information about the value of experience, it will choose the most valuable experience to update, which will yield the highest utility. As supplements, we derive two more value metrics, which corresponds to the evaluation improvement value and policy improvement value due to update on an experience. In this work, we mathematically show that these value metrics are upper-bounded by surprise for Q-learning. Therefore, surprise implicitly tracks the value of experience, and accounts for the importance of experience. We further extend our framework to maximum-entropy RL, which augments the reward with an entropy term to encourage exploration (Haarnoja et al., 2017) . We derive the lower and upper bounds of these value metrics for soft Q-learning, which are related to surprise and "on-policyness" of the experience. Experiments in Maze and CartPole support our theoretical results for both tabular and function approximation RL methods, showing that the derived bounds hold in practice. Moreover, we also show that experience replay using the upper bound as priority improves maximum-entropy RL (i.e., soft DQN) in Atari games.

2.1. Q-LEARNING AND EXPERIENCE REPLAY

We consider a Markov Decision Process (MDP) defined by a tuple {S, A, P, R, γ}, where S is a finite set of states, A is a finite set of actions, P is the transition function, R is the reward function, and γ ∈ [0, 1] is the discount factor. A policy π of an agent assigns probability π(a|s) to each action a ∈ A given state s ∈ S. The goal is to learn an optimal policy that maximizes the expected discounted return starting from time step t, G t = r t + γr t+1 + γ 2 r t+2 + ... = ∞ i=0 γ i r t+i , where r t is the reward the agent receives at time step t. Value function v π (s) is defined as the expected return starting from state s following policy π, and Q-function q π (s, a) is the expected return on performing action a in state s and subsequently following policy π. According to Q-learning (Watkins & Dayan, 1992) , the optimal policy can be learned through policy iteration: performing policy evaluation and policy improvement interactively and iteratively. For each policy evaluation, we update Q(s, a), an estimate of q π (s, a), by Q new (s, a) = Q old (s, a) + αTD(s, a, r, s ), where the TD error TD(s, a, r, s ) = r + γ max a Q old (s , a ) -Q old (s, a) and α is the step-size parameter. Q new and Q old denote the estimated Q-function before and after the update respectively. And for each policy improvement, we update the policy from π old to π new according to the newly estimated Q-function, π new = arg max a Q new (s, a). Standard Q-learning only uses each experience once before disregarded, which is sample inefficient and can be improved by experience replay technique (Lin, 1992) . We denote the experience that the agent collected at time k by a 4-tuple e k = {s k , a k , r k , s k }. According to experience replay, the experience e k is stored into the replay buffer and can be accessed multiple times during learning.

2.2. VALUE METRICS OF EXPERIENCE

To quantify the importance of experience, we derive three value metrics of experience. The utility of update on experience e k is defined as the value added to the cumulative discounted rewards starting from state s k , after updating on e k . Intuitively, choosing the most valuable experience for update will yield the highest utility to the agent. We denote such utility as the expected value of backup EVB(e k ) (Mattar & Daw, 2018) , EVB(e k ) = v πnew (s k ) -v πold (s k ) = a π new (a|s k )q πnew (s k , a) - a π old (a|s k )q πold (s k , a) where π old , v πold and q πold are respectively the policy, value function and Q-function before the update, and π new , v πnew , and q πnew are those after. As the update on experience e k consists of policy Examples of prioritized experience replay by surprise and value of experience (EVB). The main difference is that EVB only prioritizes the experiences that are associated with the optimal policy; while surprise is sensitive to changes in value function and will prioritize non-optimal experiences, such as those associated with north or south. Here squares represent states, triangles represent actions, and experiences associated with the highest priority are highlighted. d. Expected number of replays needed to learn the optimal policy, as the number of grids changes: uniform replay (blue), prioritized by surprise (orange), and EVB (green). evaluation and policy improvement, the value of experience can further be separated to evaluation improvement value EIV(e k ) and policy improvement value PIV(e k ) by rewriting (1):  EIV(e k ) , where PIV(e k ) measures the value improvements due to the change of the policy, and EIV(e k ) captures those due to the change of evaluation. Thus, we have three metrics for the value of experience: EVB, PIV and EIV.

2.3. VALUE METRICS OF EXPERIENCE IN Q-LEARNING

For Q-learning, we use Q-function to estimate the true action-value function. A backup over an experience e k consists of policy evaluation with Bellman operator and greedy policy improvement. As the policy improvement is greedy, we can rewrite value metrics of experience to simpler forms. EVB can be written as follows from (1), EVB(e k ) = max a Qnew(s k , a) -max a Q old (s k , a). Note that EVB here is different from that in Mattar & Daw (2018) : in our case, EVB is derived from Q-learning; while in their case, EVB is derived from Dyna, a model-based RL algorithm (Sutton, 1990) . Similarly, from (2), PIV can be written as PIV(e k ) = max a Qnew(s k , a) -Qnew(s k , a old ), where a old = arg max a Q old (s k , a), and EIV can be written as EIV(e k ) = Qnew(s k , a old ) -Q old (s k , a old ). (5)

2.4. A MOTIVATING EXAMPLE

We illustrate the potential gain of value of experience in a "Linear GridWorld" environment (Figure 1a ). This environment contains N linearly-aligned grids and 4 actions (north, south, east, west) . The rewards are rare: 1 for entering the goal state and 0 elsewhere. The solution for this environment is always choosing east. We use this example to highlight the difference between prioritization strategies. Three agents perform Q-learning updates on the experiences drawn from the same replay buffer, which contains all the (4N ) experiences and associated rewards. The first agent replays the experiences uniformly at random, while the other two agents invoke the oracle to prioritize the experiences, which greedily select the experience with the highest surprise or EVB respectively. In order to learn the optimal policy, agents need to replay the experiences associated with action east in a reverse order. For the agent with random replay, the expected number of replays required is 4N 2 (Figure 1d ). For the other two agents, prioritization significantly reduces the number of replays required: prioritization with surprise requires 4N replays, and prioritization with EVB only uses N replays, which is optimal (Figure 1d ). The main difference is that EVB only prioritizes the experiences that are associated with the optimal policy (Figure 1c ), while the surprise is sensitive to changes in the value function and will prioritize non-optimal experiences: for example, the agent may choose the experiences associated with south or north in the second update, which are not optimal but have the same surprise as the experience associated with east (Figure 1b ). Thus, EVB that directily quantifies the value of experience can serve as an optimal priority.

3. UPPER BOUNDS OF VALUE METRICS OF EXPERIENCE IN Q-LEARNING

PER (Schaul et al., 2016) greatly improves the learning efficiency of DQN. However, the underlying rationale is not well understood. Here, we prove that surprise is the upper bound of the value metrics in Q-learning. Proof. See Appendix A.1. In Theorem 3.1, we prove that |EVB|, |PIV|, and |EIV| are upper-bounded by the surprise (scaled by the learning step-size) in Q-learning. As surprise intrinsically tracks the evaluation and policy improvements, it can serve as an appropriate importance metric for past experiences. We will further study these relationship in experiments.

4. EXTENSION TO MAXIMUM-ENTROPY RL

In this section, we extend our framework to study the relationship between surprise and value of experience in maximum-entropy RL, particularly, soft Q-learning.

4.1. SOFT Q-LEARNING

Unlike regular RL algorithms, maximum-entropy RL augments the reward with an entropy term: R = r + βH(π(•|s)) , where H(•) is the entropy, and β is an optional temperature parameter that determines the relative importance of entropy and reward. The goal is to maximize the expected cumulative entropy-augmented rewards. Maximum-entropy RL algorithms have advantages at capturing multiple modes of near optimal policies, better exploration, and better transfer between tasks. Soft Q-learning is an off-policy value-based algorithm built on maximum-entropy RL principles (Haarnoja et al., 2017; Schulman et al., 2017) . Different from Q-learning, the target policy of soft Qlearning is stochastic. During policy iteration, Q-function is updated through soft Bellman operator Γ soft , and the policy is updated to a maximum-entropy policy: Policy Evaluation: Q soft new (s, a) = [Γ soft Q soft old ](s, a) = r + γV soft old (s ) Policy Improvement: π new (a|s) = softmax a ( 1 β Q soft new (s, a)), where softmax i (x) = exp(x i )/ i exp(x i ) is the softmax function, and the soft value function V soft π (s) is defined as, V soft π (s) = E a {Q soft π (s, a) -log(π(a|s))} = β log a exp( 1 β Q soft π (s, a)). Similar as in Q-learning, the TD-error in soft Q-learning (soft TD error) is given by: TD soft (s, a, r, s ) = r + γV soft old (s ) -Q soft old (s, a).

4.2. VALUE METRICS OF EXPERIENCE IN MAXIMUM-ENTROPY RL

Here, we extend the value metrics of experience to soft Q-learning. Similar as (1), EVB for maximum-entropy RL is defined as, EVB soft (e k ) = v soft new (s k ) -v soft old (s k ) = a π new (a|s k ){q soft new (s k , a) -β log(π new (a|s k ))} - a π old (a|s k ){q soft old (s k , a) -β log(π old (a|s k ))} EVB soft can be separated into PIV soft and EIV soft , which respectively quantify the value of policy and evaluation improvement in soft Q-learning, PIV soft (e k ) = a π new (a|s k ){q soft new (s k , a) -β log(π new (a|s k ))} - a π old (a|s k ){q soft new (s k , a) -β log(π old (a|s k ))} = a {π new (a|s k ) -π old (a|s k )}q soft new (s k , a) + β(H(π new (•|s)) -H(π old (•|s k ))), (7) EIV soft (e k ) = a π old (a|s k )[q soft new (s k , a) -q soft old (s k , a)]. Value metrics of experience in maximum-entropy RL have similar forms as in regular RL except for the entropy term, because changes in policy leads to changes in the policy entropy and affects the entropy-augmented rewards.

4.3. LOWER AND UPPER BOUNDS OF VALUE METRICS OF EXPERIENCE IN SOFT Q-LEARNING

We theoretically derive the lower and upper bounds of the value metrics of experience in soft Qlearning. Theorem 4.1. The lower and upper bounds in soft Q-learning include a policy term with the surprise (the magnitude of the soft TD error). The policy related term ρ π quantifies the "on-policyness" of the experienced action. And the bounds become tighter as the difference between π old (a k |s k ) and π new (a k |s k ) becomes smaller. Surprisingly, the coefficient of the entropy term β impacts the bound only through the policy term, which makes it an excellent priority even β changes during learning (Haarnoja et al., 2018) . As 0 ≤ ρ max π ≤ 1, the value metrics are also upper bounded by surprise ( TD soft ) alone, which is similar as in Q-learning. However, as π(a k |s k ) is usually less than 1, surprise is a looser upper bound in soft Q-learning. This supports previous study, which empirically shows that directly applying PER using surprise alone in soft Q-learning does not significantly improve the sample efficiency (Wang & Ross, 2019) . We will further study these relationship in experiments.

5. EXPERIMENTS

Our experiments aim to answer following questions: (i) Do the theoretical bounds of value metrics of experience hold true in practice? (ii) If the bounds hold, are they tight? (iii) Do the bounds derived for maximum-entropy RL help to improve performance? First, we implement tabular versions of Qlearning and soft Q-learning in Maze to verify the bounds in tabular methods. Second, by slightly modifying the definition of the value metrics of experience, we extend our framework to function approximation methods, which is far more powerful than the tabular methods (see Appendix A.4). We implement DQN and soft DQN in CartPole to examine the bounds in function approximation methods. Finally, we implement PER with the theoretical upper bound as priority for soft DQN and evaluate its effectiveness. Throughout the experiments, the value metrics of experience (EVB, PIV and EIV) for Q-learning are calculated using (3), (4), and (5), and those for soft Q-learning are calculated using ( 6), (7), and (8). For soft Q-learning, we do not model the actor explicitly: the policy is calculated as the softmax of the soft Q-function (see Section 4.1). The upper bound for value metrics in Q-learning is surprise (Theorem 3.1), while the lower and upper bounds for soft Q-learning are calculated according to Theorem 4.1 and 4.2, which include a policy term and surprise. The experimental details are described in Appendix A.5 and all the codes are available at: https://github. com/RLforlife/VER.

5.1. MAZE

The first set of experiments are conducted in a maze environment of a 5 × 5 square with walls. The agent needs to reach the goal zone by moving one square in any of the four directions (north, south, east, west) each time. We implement tabular versions of Q-learning and soft Q-learning to solve this problem. For each algorithm, the value metrics of experience as well as the theoretical bounds are illustrated in Figure 2 . As we can see from upper panel, all three value metrics of experience are bounded by the surprise for Q-learning. As our theory predicts, the absolute values of EIV are either equal to the surprise (if the action of the experience is the best action before update) or 0. For soft Q-learning, the three value metrics of experience are bounded by the theoretical upper bound, and EVB and EIV are bounded by the theoretical lower bound, supporting our theory (Theorem 4.1 and 4.2). There is a large proportion of EVBs lies on the identity line, indicating the bounds are tight. The proportion of non-zero values of experiences is higher in soft Q-learning than in Q-learning, because all value metrics of experiences are affected by the "on-policyness" of the experienced actions. Qlearning learns a deterministic policy that makes most actions of experiences off-policy, while soft Q-learning learns a stochastic policy that results in less sparse values of experiences. In summary, the experimental results in the maze environment support the theoretical bounds of value metrics of experience in Q-learning and soft Q-learning.

5.2. CARTPOLE

CartPole is a pendulum with a center of gravity above its pivot point. The goal is to keep the pole balanced by moving the cart forward and backward. We implement DQN and soft DQN (DQN with soft-update) in this environment. For DQN, we replace the Q-network in Mnih et al. (2015) with a two-layer MLP. For soft-DQN, all the settings are the same with DQN, except for two modifications: during policy evaluation, the (soft) Q-network is updated according to the soft TD error; during policy improvement, the policy is updated following a maximum-entropy policy, as the softmax of the Q values (see Section 4.1). 

5.3. ATARI GAMES

In this set of experiments, we investigate whether the theoretical upper bound of value metrics of experience, which balances the surprise and "on-policyness" of the experience (Figure 6 ), can serve as an appropriate priority for experience replay in soft Q-learning. More specifically, we compare the performance of soft DQN with different prioritization strategies: uniform replay, prioritization with surprise or the theoretical upper bound (ρ max π * TD soft ), which are denoted by soft DQN, PER and VER (valuable experience replay) respectively. This set of experiments consists of nine games from Atari 2600 games, whose goal is learning to play each of the games with the screen pixels as the only input. We closely follow the experimental setting and network architecture outlined by Mnih et al. (2015) . For each game, the network is trained on a single GPU for 40M frames, or approximately 2 days. Figure 4 shows that soft DQN prioritized by surprise or the theoretical upper bound significantly outperforms uniform replay in most of the games. On average, soft DQN with PER or VER outperform vanilla soft DQN by 11.8% or 18.0% respectively. Moreover, VER shows higher convergence speed and outperforms PER in most of the games (8.47% on average), which suggest that a tighter upper bound on value metrics improves the performance of experience replay. These results suggest that the theoretical upper bound can serve as an appropriate priority for experience replay in soft Q-learning.

6. DISCUSSION

In this work, we formulate a framework to study relationship between the importance of experience and surprise (the magnitude of the TD error). To quantify the importance of experience, we derive three value metrics of experience: expected value of backup, policy evaluation value, and policy improvement value. For Q-learning, we theoretically show these value metrics are upper bounded by surprise. Our claims are supported by the experiments of tabular Q-learning and DQN. Thus, surprise implicitly tracks the value of the experience, which leads to high sample efficiency of PER. Furthermore, we extend our framework to maximum-entropy RL, by showing that these value metrics are lower and upper bounded by the product of a policy term and surprise. The results in soft Q-learning and soft DQN supports our theory. Moreover, we employ the upper bound as the priority for experience relay, termed as VER, in soft DQN. And we empirically show that VER outperforms PER and significantly improves the sample efficiency of soft DQN. By linking surprise and value of experience, two important quantities in learning, our study has the following implications. First, from a machine learning perspective, our study provide a framework to derive appropriate priorities of experience for different algorithms, with possible extension to batch RL (Fu et al., 2020) and sequence experience replay Brittain et al. (2019) . Second, for neuroscience, our work provides insight on how brain might encode the importance of experience. Since surprise biologically corresponds to the reward prediction-error signal in the dopaminergic system (Schultz et al., 1997; Glimcher, 2011) and implicitly tracks the value of the experience, the brain may account on it to differentiate important experiences.

A APPENDIX

A.1 PROOF OF THEOREM 3.1 In this section, we derive upper bounds of value metrics of experience in Q-learning. The absolute EVB can be written as |EVB(e k )| = | max a Q new (s k , a) -max a Q old (s k , a)| ≤ max a |Q new (s k , a) -Q old (s k , a)| ≤ α|TD(s k , a k , r k , s k )|, where the second line is from the contraction of max operator. The absolute PIV can be written as |PIV(e k )| = | max a Q new (s k , a) -Q new (s k , arg max a Q old (s k , a))| = max a Q new (s k , a) -Q new (s k , arg max a Q old (s k , a)) = max a Q new (s k , a) -max a Q old (s k , a) -1 aold=a k αTD(s k , a k , r k , s k ) where the second line is from that the change in Q-function following greedy policy improvement is greater or equal to 0, and the third line is from the update of Q-function. For TD(s k , a k , r k , s k ) ≥ 0, we have 0 ≤ max a Q new (s k , a) -max a Q old (s k , a) ≤ αTD(s k , a k , r k , s k ). And for TD(s k , a k , r k , s k ) ≤ 0, we have max a Q new (s k , a) -max a Q old (s k , a) ≤ 0. Bring above inequalities to 10, we have |PIV(e k )| ≤ α|TD(s k , a k , r k , s k )| Similarly, the absolute EIV can be written as follows, |EIV(e k )| = |Q new (s k , a old ) -Q old (s k , a old )| = 1 s=s k ,aold=a k α|TD(s k , a k , r k , s k )| ≤ α|TD(s k , a k , r k , s k )| For equations ( 9) and ( 11), the equality is reached if the experienced action is the same as the best action before and after the update. For ( 12), the equality is met if the experienced action is the best action before update. A.2 PROOF OF THEOREM 4.1 In this section, we derive upper bounds of value metrics of experience in soft Q-learning. For soft-Q learning, |EVB soft | can be written as |EVB soft (e k )| = |β log a exp( 1 β Q soft new (s k , a)) -β log a exp( 1 β Q soft old (s k , a))| = |β log a exp ( 1 β (Q soft old (s k , a) + 1 a=a k TD soft )) -β log a exp 1 β Q soft old (s k , a)|. Let us define the LogSumExp function F ( x) = β log i exp ( xi β ). The LogSumExp function F ( x) is convex, and is strictly and monotonically increasing everywhere in its domain (El Ghaoui, 2018) . The partial derivative of F ( x) is a softmax function ∂F ( x) ∂x i = softmax i ( 1 β x) ≥ 0, which takes the same form as the policy of soft Q-learning. For < 0, we have: ∂F (x 1 , ..., x i , ...) ∂x i ≤ F (x 1 , ..., x i + , ...) -F (x 1 , ..., x i , ...) ≤ 0. Similarly, for ≥ 0, we have, 0 ≤ F (x 1 , ..., x i + , ...) -F (x 1 , ..., x i , ...) ≤ ∂F (x 1 , ..., x i + , ...) ∂x i . By substituting x i by Q soft old (s k , a k ) and by TD soft , and rewriting partial derivative of F ( x) into policy form, we have following inequalities. For TD soft ≤ 0 π old (a k |s k )TD soft ≤ β log a exp( 1 β Q soft new (s k , a)) -β log a exp( 1 β Q soft old (s k , a)) ≤ 0 Similarly, for TD soft > 0, we have :  0 ≤ β log a exp( 1 β Q soft new (s k , a)) -β log a exp( 1 β Q soft old (s k , a)) ≤ π new (a k |s k )TD soft = β log a exp Q soft new (s k , a) β -β log a exp Q soft old (s k , a) β -π old (a k |s)TD soft , where the second line is because the policy improvement value is always greater than or equal to 0, and the third line is by reordering the equation. For TD soft > 0, we have: 0 ≤ β log a exp Q soft new (s k , a) β -β log a exp Q soft old (s k , a) β -π old (a k |s)TD soft ≤ π new (a k |s k )TD soft For TD soft ≤ 0, we have: There is no lower bound of the similar form for |PIV soft |. 0 ≤ β log a exp Q soft new (s k , a) β -β log a exp Q soft old (s k , a) β -π old (a k |s)TD soft ≤ π old (a k |s k )TD soft A.3 PROOF OF THEOREM 4.2 In this section, we derive lower bounds of value metrics of experience in soft Q-learning. Similar as deriving upper bounds in Appendix A.2, we derive the lower bounds for |EVB| using the the LogSumExp function F ( x) = β log i exp ( xi β ). For < 0, we have: F (x 1 , ..., x i + , ...) -F (x 1 , ..., x i , ...) ≤ ∂F (x 1 , ..., x i + , ...) ∂x i ≤ 0. Similarly, for ≥ 0, we have, F (x 1 , ..., x i + , ...) -F (x 1 , ..., x i , ...) ≥ ∂F (x 1 , ..., x i , ...) ∂x i ≥ 0. By substituting x i by Q soft old (s k , a k ) and by TD soft , and rewriting partial derivative of F ( x) into policy form, we have following inequalities. For TD soft ≤ 0 β log a exp( 1 β Q soft new (s k , a)) -β log a exp( 1 β Q soft old (s k , a)) ≤ π new (a k |s k )TD soft ≤ 0 Similarly, for TD soft > 0, we have : β log a exp( 1 β Q soft new (s k , a)) -β log a exp( 1 β Q soft old (s k , a)) ≥ π old (a k |s k )TD soft ≥ 0 Thus, we have the lower bounds of |EVB soft |, EVB soft (e k ) ≥ min{π old (a k |s k ), π new (a k |s k )} * TD soft . For |EIV soft |, we have: |EIV soft (e k )| = | a π old (a|s)[Q soft πnew (s k , a) -Q soft πold (s k , a)]| = π old (a|s) * TD soft ≥ min{π old (a k |s k ), π new (a k |s k )} * TD soft A.4 EXTENSION TO FUNCTION APPROXIMATION METHOD In function approximation methods, we learn a parameterized Q-function Q(s, a; θ t ). The parameter is updated on experience e k through gradient-based method, θ t+1 = θ t + α(Q target (s k , a k ) -Q(s k , a k ; θ t ))∇ θt Q(s k , a k ; θ t )), where α is the learning rate and Q target is the target Q value, defined as Q target (s k , a k ) = r k + γ max a Q(s k , a ; θ t ). And the TD-error is defined as: TD = Q target (s k , a k ) -Q(s k , a k ; θ t ). As α in function approximation Q-learning is usually very small, for each update, the parameterized function moves to its target value only by a small amount. Our framework can be extended to function approximation method by slightly modifying the definition of the value metrics of experience: replacing the Q-function after the update (Q(s, a; θ t+1 )) by the target Q value (Q target (s, a)) in the value metrics of experience (1-5 and 6-8). The intuition behind this modification is simple: the value is defined by the cause of the update (target Q-value), but not the result of the update through gradient-based update. With this modification our theory is applicable to all function approximation methods, regardless the specific forms of the function approximator (linear function or neural networks). For Q-learning, the value metrics can be written as: EVB (e k ) = max a Q target (s k , a) -max a Q(s k , a; θ t ) PIV(e k ) = max a Q target (s k , a) -Q target (s k , a old ) EIV(e k ) = Q target (s k , a old ) -Q(s k , a old ; θ t ). And for soft Q-learning, the value metrics can be written as: EVB soft (e k ) = β log a exp( 1 β Q soft target (s k , a)) -β log a exp( 1 β Q soft (s k , a; θ t )) PIV soft (e k ) = β log a exp Q soft target (s k , a) β -β log a exp Q soft (s k , a; θ t ) β -π old (a k |s)TD soft EIV soft (e k ) = a π old (a|s)[Q soft target (s k , a) -Q soft (s k , a; θ t )]. After the modifications, the value metrics of experience have similar form as the tabular case, and all Theorems derived in the tabular case can be applied to function approximation methods. A.5.2 CARTPOLE For CartPole, the goal is to keep the pole balanced by moving the cart forward and backward for 200 steps. We test our theoretical prediction on DQN and soft-DQN (DQN with soft-update). For DQN, we implement the model according to Mnih et al. (2015) , where we replace the original Qnetwork with a two-layer MLP, with 256 Relu neurons for each layer. The in -greedy policy decays exponentially from 1 to 0.01 for the first 10, 000 steps, and remains 0.01 afterwards. For soft-DQN, all settings are the same with DQN, except for two modifications: for policy evaluation, the (soft) Q-network is updated according to the soft TD error; the policy follows maximum-entropy policy, calculated as the softmax of the soft Q values (see section 4.1). The temperature parameter β is set to 0.5. For both algorithms, the discount factor is 0.99, the learning rate is 0.005, experience buffer size is 1000, the batch size is 16 and total environment interaction is 50, 000.

A.5.3 ATARI GAMES

For this set of experiments, we compare the performance of vanilla soft DQN and soft DQN with PER, where we use surprise and the theoretical upper bound as priorities (Schaul et al., 2016) , respectively denoted as PER and VER (valuable experience replay). We select 9 Atari games for the experiments: Alien, BattleZone, Boxing, BeamRider, DemonAttack, MsPacman, Qbert, Seaquest and SpaceInvaders. The vanilla soft DQN is similar to that described in the above section, but the Q-network the same with Mnih et al. (2015) . We implement PER on soft-DQN according to Schaul et al. (2016) . For all algorithms, the temperature parameter β is 0.05, the discount factor is 0.99, the learning rate is 1e-4, experience buffer size is 1M, the batch size is 32, total environment interaction is 50, 000. For PER or VER, the parameters for importance sampling are α IS = 0.4 and β IS = 0.6. For each game, the network is trained on a single GPU for 40M frames, or approximately 2 days. 



Figure1: a. Illustration of the "Linear GridWorld" example: there are N grids and 4 actions(north,  south, east, west). Reward for entering the goal state (cheese) is 1; reward is 0 elsewhere. b-c. Examples of prioritized experience replay by surprise and value of experience (EVB). The main difference is that EVB only prioritizes the experiences that are associated with the optimal policy; while surprise is sensitive to changes in value function and will prioritize non-optimal experiences, such as those associated with north or south. Here squares represent states, triangles represent actions, and experiences associated with the highest priority are highlighted. d. Expected number of replays needed to learn the optimal policy, as the number of grids changes: uniform replay (blue), prioritized by surprise (orange), and EVB (green).

The three value metrics of experience e k in Q-learning (|EVB|, |PIV| and |EIV|) are bounded by α|TD(s k , a k , r k , s k )|, where α is a step-size parameter.

Figure 2: Results of Q-learning and soft Q-learning in Maze. a-c. surprise (the magnitude of TD error) v.s. absolute value of EVB (left), EIV (middle) and PIV (right) in Q-learning. d-f. Theoretical upper bound and g-i. lower bound v.s. absolute value of EVB, EIV and PIV in soft Q-learning. The red line is the identity line.

Figure 3: Results of DQN and soft DQN in CartPole. a-c. surprise (the magnitude of the TD error) v.s. absolute value of EVB (left), EIV (middle) and PIV (right) in DQN. d-f. Theoretical upper bound and g-i. lower bound v.s. absolute value of EVB, EIV and PIV in soft DQN. The red line is the identity line.

Figure 4: Learning curve of soft DQN (blue lines), and soft DQN with prioritized experience replay in term of soft TD error (PER, orange lines) and the theoretical upper bound of value metrics of experience (VER, green lines) on nine Atari games.

Thus, we have the upper bounds of |EVB soft |, EVB soft (e k ) ≤ max{π old (a k |s k ), π new (a k |s k )} * TD soft . For |PIV soft |, we have, |PIV soft (e k )| = | a π new (a|s){Q soft new (s k , a) -β log(π new (a|s))} a π old (a|s){Q soft new (s k , a) -β log(π old (a|s))}| = a π new (a|s){Q soft new (s k , a) -β log(π new (a|s))} a π old (a|s){Q soft old (s k , a) -β log(π old (a|s))} -π old (a k |s)TD soft

Thus, we have the upper bounds of PIV soft :PIV soft (e k ) ≤ max{π old (a k |s k ), π new (a k |s k )} * TD softAlso, for |EIV soft |, we have:|EIV soft (e k )| = | a π old (a|s)[Q soft new (s k , a) -Q soft old (s k , a)]| = π old (a|s) * TD soft ≤ max{π old (a k |s k ), π new (a k |s k )} * TD soft

Figure 5: Maze environment and learning curves.

Figure 6: Illustration on the difference between VER and PER in soft Q-learning. VER uses the theoretical upper bound as priority (ρ max π * |TD soft |), which balances the TD error and the "on-policyness" of the experience. Depicted are the theoretical upper bound (left), |TD| (middle), and the policy term (right) of 50 experiences from the replay buffer in the maze (upper panel) and CartPole (lower panel), ordered by the theoretical upper bound.

The three value metrics of experience e k in soft Q-learning (|EVB soft |, |PIV soft | and |EIV soft |) are upper bounded by ρ max π * TD soft , where ρ max π = max{π old (a k |s k ), π new (a k |s k )} is a policy related term. For soft Q-learning, |EVB soft | and |EIV soft | (but not |PIV soft |) are lower bounded by ρ min π * TD soft , where ρ min π = min{π old (a k |s k ), π new (a k |s k )} is a policy related term.

