MEMORY-EFFICIENT REINFORCEMENT LEARNING WITH PRIORITY BASED ON SURPRISE AND ON-POLICYNESS Anonymous

Abstract

In off-policy reinforcement learning, an agent collects transition data (a.k.a. experience tuples) from the environment and stores them in a replay buffer for the incoming parameter updates. Storing those tuples consumes a large amount of memory when the environment observations are given as images. Large memory consumption is especially problematic when reinforcement learning methods are applied in scenarios where the computational resources are limited. In this paper, we introduce a method to prune relatively unimportant experience tuples by a simple metric that estimates the importance of experiences and saves the overall memory consumption by the buffer. To measure the importance of experiences, we use surprise and on-policyness. Surprise is quantified by the information gain the model can obtain from the experiences and on-policyness ensures that they are relevant to the current policy. In our experiments, we empirically show that our method can significantly reduce the memory consumption by the replay buffer without decreasing the performance in vision-based environments.

1. INTRODUCTION

Reinforcement learning (RL) has become a promising approach for learning complex and intelligent behavior from visual inputs (Mnih et al., 2016; Kalashnikov et al., 2018) . In particular, off-policy RL algorithms (Mnih et al., 2015; Hessel et al., 2018) generally achieve better sample efficiency than on-policy algorithms by using experience replay (Lin, 1992) . In experience replay, the transitions observed in the environment are stored as experience tuples in a replay buffer and used repeatedly. In addition, the replay buffer has the role to remove the correlations between the samples in a minibatch. However, these methods require a significant number of experience tuples, which consume a large amount of memory when the observations are given as images. Many prior studies on replay buffers in RL consider how the experience tuples are sampled from the buffer (Schaul et al., 2016; Zha et al., 2019; Fujimoto et al., 2020; Sun et al., 2020; Oh et al., 2021) . If we are to train an agent in a scenario where the available resources are limited, the replay buffer needs to be reduced to an appropriate size. It is known that simply reducing the size of the replay buffer will lead to unexpected performance degradation (Liu & Zou, 2018; Fedus et al., 2020) . There is some prior work on how to select old experience tuples to overwrite when a new experience tuple comes into a relatively small buffer (Pieters & Wiering, 2016; de Bruin et al., 2016b; 2018) . However, they do not consider a memory-efficient method for image observation where memory consumption is large. We aim to reduce the size of the replay buffer without degrading the performance in visual domains. Our intuition is that some experience tuples are important for gaining knowledge about the environment and others are not. For example, the scenes in a video game that do not accept any inputs from the player, such as the standby screen, occupy a considerable amount of time in the game, but they do not provide much information. In contrast, the frames that are within a few frames of the scenes where the player earns or loses points are often important in the game. In particular, the scenes that are related to the end of a gameplay are important to keep the game going and to obtain high scores. On the basis of this intuition, we propose to prioritize and keep experience tuples that are deemed important and discard the others. The overview of our approach is shown in Figure 1 . In this paper, to estimate the importance of the experiences and to prune the unnecessary experience tuples stored in the replay buffer. We hypothesize that the importance of an experience tuple is determined by the degree of information that the model gains by obtaining the experience tuple and the strength of relevance to the current policy. Surprise is related to the uncertainty of an experience tuple and represents the novelty of the information. At the same time, however, the agent does not need to keep all the experience tuples that have high uncertainty, especially the ones that are likely to be outliers. The on-policyness metric is introduced to suppress the adverse effects of those outliers and keep the experience tuples close to the actual transition the current agent takes. We demonstrate that our method can be implemented as simple modifications to the existing implementations for replay buffers. In addition, our approach can be combined with the existing approaches that are used to sample experience tuples from the buffer. We show in the experiments that the proposed method of data pruning can save the memory consumption of the buffer and prevent a performance decrease when the size of the buffer is limited.

2. RELATED WORK

Experience replay using replay buffers is a common method in off-policy RL algorithms to make the most of the experiences obtained by the behavior policy (Mnih et al., 2015; Haarnoja et al., 2018; Hessel et al., 2018) . Most of the research on replay buffers focuses on how the experience tuples are sampled from the buffer when creating a mini-batch to accelerate the training process. The most well-known is the prioritized experience replay (PER) (Schaul et al., 2016) , which prioritizes the sampling of the experience tuples based on their training errors. There have been many studies investigating other effective indicators for sampling the experience tuples to achieve efficient training. Some used fixed metrics to select experience tuples (Schaul et al., 2016; Fujimoto et al., 2020; Sinha et al., 2022) , while others constructed and trained models to select the tuples which maximize the improvement of the policy model after the parameter update (Zha et al., 2019; Oh et al., 2021) . There have also been some studies examining experience selection methods based on indicators such as reward (Pieters & Wiering, 2016 ), surprise (de Bruin et al., 2018) , and exploration (de Bruin et al., 2016a; 2018) , and how they affect the performance of an RL agent in state-based environments using a relatively small buffer. Chen et al. ( 2021) proposed a method that aims to construct a memory-efficient RL algorithm in vision-based domains. In this method, they freeze the parameters in the convolutional neural network (CNN) encoder at the early stage of the training and store the latent vector from the encoder, instead of storing raw images to save memory consumption. The major difference in our approach is that we choose the experiences to discard. Since the discarding of the experiences is done independently of the sampling process, the sampling methods used in the replay buffer can be combined with our method. We introduce a method to select experience tuples in RL settings. Determining which tuples to keep can be seen in a continual learning setting (Lopez-Paz & Ranzato, 2017; Shin et al., 2017; Isele & Cosgun, 2018; Rolnick et al., 2019) . In these settings, the model is given a stream of data and the training is done in an online manner. These settings suffer from catastrophic forget-



Figure 1: Illustration of our proposed method. Priority of the experience tuples are calculated, and they are discarded based on the priority when a new experience tuple arrives. The capacity of the replay buffer in the figure is small for the sake of readability.

