LEARNING TO SAMPLE WITH LOCAL AND GLOBAL CONTEXTS FROM EXPERIENCE REPLAY BUFFERS

Abstract

Experience replay, which enables the agents to remember and reuse experience from the past, has played a significant role in the success of off-policy reinforcement learning (RL). To utilize the experience replay efficiently, the existing sampling methods allow selecting out more meaningful experiences by imposing priorities on them based on certain metrics (e.g. TD-error). However, they may result in sampling highly biased, redundant transitions since they compute the sampling rate for each transition independently, without consideration of its importance in relation to other transitions. In this paper, we aim to address the issue by proposing a new learning-based sampling method that can compute the relative importance of transition. To this end, we design a novel permutation-equivariant neural architecture that takes contexts from not only features of each transition (local) but also those of others (global) as inputs. We validate our framework, which we refer to as Neural Experience Replay Sampler (NERS) 1 , on multiple benchmark tasks for both continuous and discrete control tasks and show that it can significantly improve the performance of various off-policy RL methods. Further analysis confirms that the improvements of the sample efficiency indeed are due to sampling diverse and meaningful transitions by NERS that considers both local and global contexts.

1. INTRODUCTION

Experience replay (Mnih et al., 2015) , which is a memory that stores the past experiences to reuse them, has become a popular mechanism for reinforcement learning (RL), since it stabilizes training and improves the sample efficiency. The success of various off-policy RL algorithms largely attributes to the use of experience replay (Fujimoto et al., 2018; Haarnoja et al., 2018a; b; Lillicrap et al., 2016; Mnih et al., 2015) . However, most off-policy RL algorithms usually adopt a unique random sampling (Fujimoto et al., 2018; Haarnoja et al., 2018a; Mnih et al., 2015) , which treats all past experiences equally, so it is questionable whether this simple strategy would always sample the most effective experiences for the agents to learn. Several sampling policies have been proposed to address this issue. One of the popular directions is to develop rule-based methods, which prioritize the experiences with pre-defined metrics (Isele & Cosgun, 2018; Jaderberg et al., 2016; Novati & Koumoutsakos, 2019; Schaul et al., 2016) . Notably, since TD-error based sampling has improved the performance of various off-policy RL algorithms (Hessel et al., 2018; Schaul et al., 2016) by prioritizing more meaningful samples, i.e., high TD-error, it is one of the most frequently used rule-based methods. Here, TD-error measures how unexpected the returns are from the current value estimates (Schaul et al., 2016) . However, such rule-based sampling strategies can lead to sampling highly biased experiences. For instance, Figure 1 shows randomly selected 10 transitions among 64 transitions sampled using certain 1 Code is available at https://github.com/youngmin0oh/NERS 1

