LEARNING TO SAMPLE WITH LOCAL AND GLOBAL CONTEXTS FROM EXPERIENCE REPLAY BUFFERS

Abstract

Experience replay, which enables the agents to remember and reuse experience from the past, has played a significant role in the success of off-policy reinforcement learning (RL). To utilize the experience replay efficiently, the existing sampling methods allow selecting out more meaningful experiences by imposing priorities on them based on certain metrics (e.g. TD-error). However, they may result in sampling highly biased, redundant transitions since they compute the sampling rate for each transition independently, without consideration of its importance in relation to other transitions. In this paper, we aim to address the issue by proposing a new learning-based sampling method that can compute the relative importance of transition. To this end, we design a novel permutation-equivariant neural architecture that takes contexts from not only features of each transition (local) but also those of others (global) as inputs. We validate our framework, which we refer to as Neural Experience Replay Sampler (NERS) 1 , on multiple benchmark tasks for both continuous and discrete control tasks and show that it can significantly improve the performance of various off-policy RL methods. Further analysis confirms that the improvements of the sample efficiency indeed are due to sampling diverse and meaningful transitions by NERS that considers both local and global contexts.

1. INTRODUCTION

Experience replay (Mnih et al., 2015) , which is a memory that stores the past experiences to reuse them, has become a popular mechanism for reinforcement learning (RL), since it stabilizes training and improves the sample efficiency. The success of various off-policy RL algorithms largely attributes to the use of experience replay (Fujimoto et al., 2018; Haarnoja et al., 2018a; b; Lillicrap et al., 2016; Mnih et al., 2015) . However, most off-policy RL algorithms usually adopt a unique random sampling (Fujimoto et al., 2018; Haarnoja et al., 2018a; Mnih et al., 2015) , which treats all past experiences equally, so it is questionable whether this simple strategy would always sample the most effective experiences for the agents to learn. Several sampling policies have been proposed to address this issue. One of the popular directions is to develop rule-based methods, which prioritize the experiences with pre-defined metrics (Isele & Cosgun, 2018; Jaderberg et al., 2016; Novati & Koumoutsakos, 2019; Schaul et al., 2016) . Notably, since TD-error based sampling has improved the performance of various off-policy RL algorithms (Hessel et al., 2018; Schaul et al., 2016) by prioritizing more meaningful samples, i.e., high TD-error, it is one of the most frequently used rule-based methods. Here, TD-error measures how unexpected the returns are from the current value estimates (Schaul et al., 2016) . However, such rule-based sampling strategies can lead to sampling highly biased experiences. For instance, Figure 1 shows randomly selected 10 transitions among 64 transitions sampled using certain Motivated by the aforementioned observations, we aim to develop a method to sample both diverse and meaningful transitions. To cache both of them, it is crucial to measure the relative importance among sampled transitions since the diversity should be considered in them, not all in the buffer. To this end, we propose a novel neural sampling policy, which we refer to Neural Experience Replay Sampler (NERS). Our method learns to measure the relative importance among sampled transitions by extracting local and global contexts from each of them and all sampled ones, respectively. In particular, NERS is designed to take a set of each experience's features as input and compute its outputs in an equivariant manner with respect to the permutation of the set. Here, we consider various features of transition such as TD-error, Q-value and the raw transition, e.g., expecting to sample intermediate transitions as those in blue boxes of Figure 1 (c)) efficiently. To verify the effectiveness of NERS, we validate the experience replay with various off-policy RL algorithms such as soft actor-critic (SAC) (Haarnoja et al., 2018a) and twin delayed deep deterministic (TD3) (Fujimoto et al., 2018) for continuous control tasks (Brockman et al., 2016; Todorov et al., 2012), and Rainbow (Hessel et al., 2018) for discontinuous control tasks (Bellemare et al., 2013) . Our experimental results show that NERS consistently (and often significantly for complex tasks having high-dimensional state and action spaces) outperforms both the existing the rule-based (Schaul et al., 2016) and learning-based (Zha et al., 2019) sampling methods for experience replay. In summary, our contribution is threefold: • To the best of our knowledge, we first investigate the relative importance of sampled transitions for the efficient design of experience replays. • For the purpose, we design a novel permutation-equivariant neural sampling architecture that utilizes contexts from the individual (local) and the collective (global) transitions with various features to sample not only meaningful but also diverse experiences. • We validate the effectiveness of our neural experience replay on diverse continuous and discrete control tasks with various off-policy RL algorithms, on which it consistently outperforms both existing rule-based and learning-based sampling methods.



Code is available at https://github.com/youngmin0oh/NERS



Figure 1: Sampled transitions on Pendulum-v0 from various sampling strategies: (a) Sampling by TD-error, (b)Sampling by Q-value, (c) Sampling uniformly at random. Samples highlighted in black, orange, and cyan boxes denote that their state has the rod in downward, upright, and horizontal positions with appropriate amount of actions, respectively. Samples in red boxes have excessively large actions.

