REVISITING PRIORITIZED EXPERIENCE REPLAY: A VALUE PERSPECTIVE

Abstract

Reinforcement learning (RL) agents need to learn from past experiences. Prioritized experience replay that weighs experiences by their surprise (the magnitude of the temporal-difference error) significantly improves the learning efficiency for RL algorithms. Intuitively, surprise quantifies the unexpectedness of an experience to the learning agent. But how surprise is related to the importance of experience is not well understood. To address this problem, we derive three value metrics to quantify the importance of experience, which consider the extra reward would be earned by accessing the experience. We theoretically show these value metrics are upper-bounded by surprise for Q-learning. Furthermore, we successfully extend our theoretical framework to maximum-entropy RL by deriving the lower and upper bounds of these value metrics for soft Q-learning, which turn out to be the product of surprise and "on-policyness" of the experiences. Our framework links two important quantities in RL, i.e., surprise and value of experience, and provides a theoretical basis to estimate the value of experience by surprise. We empirically show that the bounds hold in practice, and experience replay using the upper bound as priority improves maximum-entropy RL in Atari games.

1. INTRODUCTION

Learning from important experiences prevails in nature. In rodent hippocampus, memories with higher importance, such as those associated with rewarding locations or large reward-prediction errors, are replayed more frequently (Michon et al., 2019; Roscow et al., 2019; Salvetti et al., 2014) . People who have more frequent replay of high-reward associated memories show better performance in memory tasks (Gruber et al., 2016; Schapiro et al., 2018) . A normative theory suggests that prioritized memory access according to the utility of memory explains hippocampal replay across different memory tasks (Mattar & Daw, 2018) . As accumulating new experiences is costly, utilizing valuable past experiences is a key for efficient learning ( Ólafsdóttir et al., 2018) . Differentiating important experiences from unimportant ones also benefits reinforcement learning (RL) algorithms (Katharopoulos & Fleuret, 2018) . Prioritized experience replay (PER) (Schaul et al., 2016) is an experience replay technique built on deep Q-network (DQN) (Mnih et al., 2015) , which weighs the importance of samples by their surprise, the magnitude of the temporal-difference (TD) error. As a result, experiences with larger surprise are sampled more frequently. PER significantly improves the learning efficiency of DQN, and has been adopted (Hessel et al., 2018; Horgan et al., 2018; Kapturowski et al., 2019) and extended (Daley & Amato, 2019; Pan et al., 2018; Schlegel et al., 2019) by various deep RL algorithms. Surprise quantifies the unexpectedness of an experience to a learning agent, and biologically corresponds to the signal of reward prediction error in dopamine system (Schultz et al., 1997; Glimcher, 2011) , which directly shapes the memory of animal and human (Lisman & Grace, 2005; McNamara et al., 2014) . However, how surprise is related to the importance of experience in the context of RL is not well understood. We address this problem from an economic perspective, by linking surprise to value of experience in RL. The goal of RL agent is to maximize the expected cumulative reward, which is achieved through learning from experiences. For Q-learning, an update on an experience will lead to a more accurate prediction of the action-value or a better policy, which increases the expected cumulative reward the agent may get. We define the value of experience as the increase in the expected cumulative reward resulted from updating on the experience (Mattar & Daw, 2018) . The value of experience quantifies

