REVISITING PRIORITIZED EXPERIENCE REPLAY: A VALUE PERSPECTIVE

Abstract

Reinforcement learning (RL) agents need to learn from past experiences. Prioritized experience replay that weighs experiences by their surprise (the magnitude of the temporal-difference error) significantly improves the learning efficiency for RL algorithms. Intuitively, surprise quantifies the unexpectedness of an experience to the learning agent. But how surprise is related to the importance of experience is not well understood. To address this problem, we derive three value metrics to quantify the importance of experience, which consider the extra reward would be earned by accessing the experience. We theoretically show these value metrics are upper-bounded by surprise for Q-learning. Furthermore, we successfully extend our theoretical framework to maximum-entropy RL by deriving the lower and upper bounds of these value metrics for soft Q-learning, which turn out to be the product of surprise and "on-policyness" of the experiences. Our framework links two important quantities in RL, i.e., surprise and value of experience, and provides a theoretical basis to estimate the value of experience by surprise. We empirically show that the bounds hold in practice, and experience replay using the upper bound as priority improves maximum-entropy RL in Atari games.

1. INTRODUCTION

Learning from important experiences prevails in nature. In rodent hippocampus, memories with higher importance, such as those associated with rewarding locations or large reward-prediction errors, are replayed more frequently (Michon et al., 2019; Roscow et al., 2019; Salvetti et al., 2014) . People who have more frequent replay of high-reward associated memories show better performance in memory tasks (Gruber et al., 2016; Schapiro et al., 2018) . A normative theory suggests that prioritized memory access according to the utility of memory explains hippocampal replay across different memory tasks (Mattar & Daw, 2018) . As accumulating new experiences is costly, utilizing valuable past experiences is a key for efficient learning ( Ólafsdóttir et al., 2018) . Differentiating important experiences from unimportant ones also benefits reinforcement learning (RL) algorithms (Katharopoulos & Fleuret, 2018) . Prioritized experience replay (PER) (Schaul et al., 2016) is an experience replay technique built on deep Q-network (DQN) (Mnih et al., 2015) , which weighs the importance of samples by their surprise, the magnitude of the temporal-difference (TD) error. As a result, experiences with larger surprise are sampled more frequently. PER significantly improves the learning efficiency of DQN, and has been adopted (Hessel et al., 2018; Horgan et al., 2018; Kapturowski et al., 2019) and extended (Daley & Amato, 2019; Pan et al., 2018; Schlegel et al., 2019) by various deep RL algorithms. Surprise quantifies the unexpectedness of an experience to a learning agent, and biologically corresponds to the signal of reward prediction error in dopamine system (Schultz et al., 1997; Glimcher, 2011) , which directly shapes the memory of animal and human (Lisman & Grace, 2005; McNamara et al., 2014) . However, how surprise is related to the importance of experience in the context of RL is not well understood. We address this problem from an economic perspective, by linking surprise to value of experience in RL. The goal of RL agent is to maximize the expected cumulative reward, which is achieved through learning from experiences. For Q-learning, an update on an experience will lead to a more accurate prediction of the action-value or a better policy, which increases the expected cumulative reward the agent may get. We define the value of experience as the increase in the expected cumulative reward resulted from updating on the experience (Mattar & Daw, 2018) . The value of experience quantifies the importance of experience from first principles: assuming that the agent is economically rational and has full information about the value of experience, it will choose the most valuable experience to update, which will yield the highest utility. As supplements, we derive two more value metrics, which corresponds to the evaluation improvement value and policy improvement value due to update on an experience. In this work, we mathematically show that these value metrics are upper-bounded by surprise for Q-learning. Therefore, surprise implicitly tracks the value of experience, and accounts for the importance of experience. We further extend our framework to maximum-entropy RL, which augments the reward with an entropy term to encourage exploration (Haarnoja et al., 2017) . We derive the lower and upper bounds of these value metrics for soft Q-learning, which are related to surprise and "on-policyness" of the experience. Experiments in Maze and CartPole support our theoretical results for both tabular and function approximation RL methods, showing that the derived bounds hold in practice. Moreover, we also show that experience replay using the upper bound as priority improves maximum-entropy RL (i.e., soft DQN) in Atari games.

2. MOTIVATION 2.1 Q-LEARNING AND EXPERIENCE REPLAY

We consider a Markov Decision Process (MDP) defined by a tuple {S, A, P, R, γ}, where S is a finite set of states, A is a finite set of actions, P is the transition function, R is the reward function, and γ ∈ [0, 1] is the discount factor. A policy π of an agent assigns probability π(a|s) to each action a ∈ A given state s ∈ S. The goal is to learn an optimal policy that maximizes the expected discounted return starting from time step t, G t = r t + γr t+1 + γ 2 r t+2 + ... = ∞ i=0 γ i r t+i , where r t is the reward the agent receives at time step t. Value function v π (s) is defined as the expected return starting from state s following policy π, and Q-function q π (s, a) is the expected return on performing action a in state s and subsequently following policy π. According to Q-learning (Watkins & Dayan, 1992) , the optimal policy can be learned through policy iteration: performing policy evaluation and policy improvement interactively and iteratively. For each policy evaluation, we update Q(s, a), an estimate of q π (s, a), by Q new (s, a) = Q old (s, a) + αTD(s, a, r, s ), where the TD error TD(s, a, r, s ) = r + γ max a Q old (s , a ) -Q old (s, a) and α is the step-size parameter. Q new and Q old denote the estimated Q-function before and after the update respectively. And for each policy improvement, we update the policy from π old to π new according to the newly estimated Q-function, π new = arg max a Q new (s, a). Standard Q-learning only uses each experience once before disregarded, which is sample inefficient and can be improved by experience replay technique (Lin, 1992). We denote the experience that the agent collected at time k by a 4-tuple e k = {s k , a k , r k , s k }. According to experience replay, the experience e k is stored into the replay buffer and can be accessed multiple times during learning.

2.2. VALUE METRICS OF EXPERIENCE

To quantify the importance of experience, we derive three value metrics of experience. The utility of update on experience e k is defined as the value added to the cumulative discounted rewards starting from state s k , after updating on e k . Intuitively, choosing the most valuable experience for update will yield the highest utility to the agent. We denote such utility as the expected value of backup EVB(e k ) (Mattar & Daw, 2018), EVB(e k ) = v πnew (s k ) -v πold (s k ) = a π new (a|s k )q πnew (s k , a)a π old (a|s k )q πold (s k , a) where π old , v πold and q πold are respectively the policy, value function and Q-function before the update, and π new , v πnew , and q πnew are those after. As the update on experience e k consists of policy

