REPLAY MEMORY AS AN EMPIRICAL MDP: COMBIN-ING CONSERVATIVE ESTIMATION WITH EXPERIENCE REPLAY

Abstract

Experience replay, which stores transitions in a replay memory for repeated use, plays an important role of improving sample efficiency in reinforcement learning. Existing techniques such as reweighted sampling, episodic learning and reverse sweep update process the information in the replay memory to make experience replay more efficient. In this work, we further exploit the information in the replay memory by treating it as an empirical Replay Memory MDP (RM-MDP). By solving it with dynamic programming, we learn a conservative value estimate that only considers transitions observed in the replay memory. Both value and policy regularizers based on this conservative estimate are developed and integrated with model-free learning algorithms. We design the memory density metric to measure the quality of RM-MDP. Our empirical studies quantitatively find a strong correlation between performance improvement and memory density. Our method combines Conservative Estimation with Experience Replay (CEER), improving sample efficiency by a large margin, especially when the memory density is high. Even when the memory density is low, such a conservative estimate can still help to avoid suicidal actions and thereby improve performance.

1. INTRODUCTION

Improving sample efficiency is an essential challenge for deep reinforcement learning (DRL). Experience replay (Lin, 1992) stores transitions in a replay memory and reuses them multiple times. This technique plays an important role for the success of DRL algorithms, such as Deep Q-Networks (DQN) (Mnih et al., 2015) , Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015; Haarnoja et al., 2018) and AlphaGo (Silver et al., 2016; 2017) . DRL algorithms use gradient-based optimizers to incrementally update parameters. The learning requires several orders of magnitude more training samples than a human learning the same task. To speed up learning, many researchers focus on how to better process information in the replay memory before updating networks. One research direction is to measure the relative importance of transitions and sample them with different priorities. Criteria for measuring the importance include temporal-difference (TD) error (Schaul et al., 2016) , the "age" (Fedus et al., 2020 ) of transitions (Zhang & Sutton, 2017; Novati & Koumoutsakos, 2019; Sun et al., 2020; Wang et al., 2020; Sinha et al., 2022) , errors of target values (Kumar et al., 2020a) , and coverage of the sample space (Oh et al., 2021a; Pan et al., 2022) . Another direction is to analyze the trait of transitions. For example, learning should start in a backward manner that allows sparse and delayed rewards to propagate faster (Lee et al., 2019) ; the best episodic experiences can be memorized and supervise the agent during learning (Lin et al., 2018) ; similar episodes may latch on and provide more information for update (Zhu et al., 2020; Hong et al., 2022; Jiang et al., 2022) . Can we further improve the use of the replay memory? In our work, we consider the Replay Memory as an empirical MDP, called RM-MDP. Solving RM-MDP provides us an estimate that considers all existing transitions in the replay memory together, not just single samples or trajectories at a time. The estimate is conservative because some actions may not be contained in the memory at all. We propose to use this conservative estimate Q from RM-MDP to regularize the learning of the Q network on the original MDP. We design two regularizers, a value regularizer and a policy regularizer. The value regularizer computes the target value by combining the estimates from Q and Q network. For the policy regularizer, we derive Boltzmann policies from the Q and Q network, and constrain the distance between the two policies using Kullback-Leibler (KL) divergence. Our contribution is four-fold: (1) We consider the replay memory as an empirical Replay Memory MDP (RM-MDP) and obtain a conservative estimate by solving this empirical MDP. The MDP is non-stationary and is updated efficiently by sampling. (2) We design value and policy regularizers based on the Conservative Estimate, and combine them with Experience Replay (CEER) to regularize the learning of DQN. (3) We introduce memory density as a measure of the quality of RM-MDP, and empirically find a strong correlation between the performance improvement and memory density. The relationship gives us a clear indication of when our method will help. (4) Experiments on Sokoban (Schrader, 2018), MiniGrid (Chevalier-Boisvert et al., 2018) and MinAtar (Young & Tian, 2019) environments show that our method improves sample efficiency by a large margin especially when the memory density is high. We also show that even if the memory density is low, the conservative estimate can help to avoid suicidal actions and still benefit the learning, and our method is effective in environments with sparse and delayed rewards.

2. RELATED WORK

Reweighted Experience Replay. Since DQN (Mnih et al., 2015) replays experiences uniformly to update Q network and achieves human-level performance on Atari games, improving experience replay (Lin, 1992) became an active direction. Prioritized Experience Replay (PER) (Schaul et al., 2016) is the first and most famous improvement. It replays important transitions as measured by TD errors more frequently. Oh et al. (2021b) additionally learn an environment model and prioritize the experiences with both high model estimation errors and TD errors. Saglam et al. (2022) show that for actor-critic algorithms in continuous control, actor networks should be trained with low TD error transitions. Sampling recent transitions is another popular metric (Novati & Koumoutsakos, 2019; Sun et al., 2020; Wang et al., 2020; Cicek et al., 2021) . These authors argue that old transitions will hurt the update since the current policy is much different from previous ones. Distribution Correction (DisCor) (Kumar et al., 2020a) reweights transitions inversely proportional to the estimated errors in target values. For these rule-based sampling strategies, Fujimoto et al. (2020) show that correcting the loss function is equivalent to non-uniform sampling, which provides a new direction for experience replay. Besides, learning a replay policy (Zha et al., 2019; Oh et al., 2021a) is also a feasible approach. These methods alternately learn a replay policy and a target policy. Experience Replay with Episodic RL. Episodic RL (Blundell et al., 2016; Pritzel et al., 2017; Hansen et al., 2018) uses a separate lookup table to memorize the best experiences ever encountered. Episodic Memory Deep Q-Networks (EMDQN) (Lin et al., 2018) uses the maximum return from episodic memory as target value. This supervised learning compensates for the slow learning resulting from single step reward updates. Episodic Reinforcement Learning with Associative Memory (ERLAM) (Zhu et al., 2020) builds a graph on top of state transitions and uses it as early guidance. The RM-MDP construction in our work is somewhat similar, but more general. We construct the MDP and solve it through a modified dynamic programming scheme. We have no constraint on the MDP, so it can be stochastic and can have cycles. Experience Replay with Reverse Update. The work by Dai & Hansen (2007); Goyal et al. (2019) focuses on the sequential character of trajectories. They argue that a state's correct Q value is preconditioned on the accurate successor states' Q values. Episodic Backward Update (EBU) (Lee et al., 2019) samples a whole episode and successively propagates the value of a state to its previous states. Topological Experience Replay (TER) (Hong et al., 2022) builds a graph to remember all the predecessors of each state, and performs updates from terminal states by successively moving backward via breadth-first search. Our method propagates rewards when solving RM-MDP, thereby the sequential character of trajectories is considered in the regularizers.

funding

* Work done while an intern at Huawei Noah's Ark Lab.

