LOOK BACK WHEN SURPRISED: STABILIZING RE-VERSE EXPERIENCE REPLAY FOR NEURAL APPROXI-MATION

Abstract

Experience replay-based sampling techniques are essential to several reinforcement learning (RL) algorithms since they aid in convergence by breaking spurious correlations. The most popular techniques, such as uniform experience replay (UER) and prioritized experience replay (PER), seem to suffer from sub-optimal convergence and significant bias error, respectively. To alleviate this, we introduce a new experience replay method for reinforcement learning, called Introspective Experience Replay (IER). IER picks batches corresponding to data points consecutively before the 'surprising' points. Our proposed approach is based on the theoretically rigorous reverse experience replay (RER), which can be shown to remove bias in the linear approximation setting but can be sub-optimal with neural approximation. We show empirically that IER is stable with neural function approximation and has a superior performance compared to the state-of-the-art techniques like uniform experience replay (UER), prioritized experience replay (PER), and hindsight experience replay (HER) on the majority of tasks.

1. INTRODUCTION

Reinforcement learning (RL) involves learning with dependent data, and algorithms designed for independent data might behave poorly coupled with the Markovian trajectories encountered in this setting. Experience replay (Lin, 1992) involves storing the received data points in a large buffer and producing a random sample from this buffer whenever the learning algorithm requires it. Therefore experience replay is usually deployed with popular algorithms like DQN, DDPG and TD3 to achieve state-of-the-art performance (Mnih et al., 2015; Lillicrap et al., 2015) . It has been shown experimentally (Mnih et al., 2015) and theoretically (Nagaraj et al., 2020 ) that these learning algorithms for Markovian data behave sub-optimally without experience replay. Note that we use the term "sub-optimal" when consistently observing a sub-par performance compared to the oracle. In contrast, the term "instability" refers to the setting where there is a high variance in our experiments, where only a few seeds work well. We maintain this distinction throughout our paper. The simplest and most widely used experience replay method is the uniform experience replay (UER), where the data points stored in the buffer are sampled uniformly at random every time a data point is queried (Mnih et al., 2015) . However, UER might pick uninformative data points most of the time, which may slow down the convergence. For this reason, optimistic experience replay (OER) and prioritized experience replay (PER) (Schaul et al., 2015) were introduced, where samples with higher TD error (i.e., 'surprise') are sampled more often from the buffer. Optimistic experience replay (originally called "greedy TD-error prioritisation") was shown to have a high bias, and Prioritized experience replay was proposed to solve this issue (Schaul et al., 2015) . However, as shown in our experiments outside of the Atari environments, PER still suffers from the problem of high bias. Although this speeds up the learning process in many cases, there can be significant biases due to picking and choosing only specific data points, which can make this method sub-optimal. The design of experience replay continues to be an active field of research. Several other experience replay techniques like Hindsight experience replay (HER) (Andrychowicz et al., 2017) , Reverse Experience Replay (RER) (Rotinov, 2019) , and Topological Experience Replay (TER) (Hong et al., 2022) have been proposed. An overview of these methods in discussed in Section 2.

