MEMORY-EFFICIENT REINFORCEMENT LEARNING WITH PRIORITY BASED ON SURPRISE AND ON-POLICYNESS Anonymous

Abstract

In off-policy reinforcement learning, an agent collects transition data (a.k.a. experience tuples) from the environment and stores them in a replay buffer for the incoming parameter updates. Storing those tuples consumes a large amount of memory when the environment observations are given as images. Large memory consumption is especially problematic when reinforcement learning methods are applied in scenarios where the computational resources are limited. In this paper, we introduce a method to prune relatively unimportant experience tuples by a simple metric that estimates the importance of experiences and saves the overall memory consumption by the buffer. To measure the importance of experiences, we use surprise and on-policyness. Surprise is quantified by the information gain the model can obtain from the experiences and on-policyness ensures that they are relevant to the current policy. In our experiments, we empirically show that our method can significantly reduce the memory consumption by the replay buffer without decreasing the performance in vision-based environments.

1. INTRODUCTION

Reinforcement learning (RL) has become a promising approach for learning complex and intelligent behavior from visual inputs (Mnih et al., 2016; Kalashnikov et al., 2018) . In particular, off-policy RL algorithms (Mnih et al., 2015; Hessel et al., 2018) generally achieve better sample efficiency than on-policy algorithms by using experience replay (Lin, 1992) . In experience replay, the transitions observed in the environment are stored as experience tuples in a replay buffer and used repeatedly. In addition, the replay buffer has the role to remove the correlations between the samples in a minibatch. However, these methods require a significant number of experience tuples, which consume a large amount of memory when the observations are given as images. Many prior studies on replay buffers in RL consider how the experience tuples are sampled from the buffer (Schaul et al., 2016; Zha et al., 2019; Fujimoto et al., 2020; Sun et al., 2020; Oh et al., 2021) . If we are to train an agent in a scenario where the available resources are limited, the replay buffer needs to be reduced to an appropriate size. It is known that simply reducing the size of the replay buffer will lead to unexpected performance degradation (Liu & Zou, 2018; Fedus et al., 2020) . There is some prior work on how to select old experience tuples to overwrite when a new experience tuple comes into a relatively small buffer (Pieters & Wiering, 2016; de Bruin et al., 2016b; 2018) . However, they do not consider a memory-efficient method for image observation where memory consumption is large. We aim to reduce the size of the replay buffer without degrading the performance in visual domains. Our intuition is that some experience tuples are important for gaining knowledge about the environment and others are not. For example, the scenes in a video game that do not accept any inputs from the player, such as the standby screen, occupy a considerable amount of time in the game, but they do not provide much information. In contrast, the frames that are within a few frames of the scenes where the player earns or loses points are often important in the game. In particular, the scenes that are related to the end of a gameplay are important to keep the game going and to obtain high scores. On the basis of this intuition, we propose to prioritize and keep experience tuples that are deemed important and discard the others. The overview of our approach is shown in Figure 1 . In this paper, Figure 1 : Illustration of our proposed method. Priority of the experience tuples are calculated, and they are discarded based on the priority when a new experience tuple arrives. The capacity of the replay buffer in the figure is small for the sake of readability. we propose a metric based on surprise (Itti & Baldi, 2005) and on-policyness (Fedus et al., 2020) to estimate the importance of the experiences and to prune the unnecessary experience tuples stored in the replay buffer. We hypothesize that the importance of an experience tuple is determined by the degree of information that the model gains by obtaining the experience tuple and the strength of relevance to the current policy. Surprise is related to the uncertainty of an experience tuple and represents the novelty of the information. At the same time, however, the agent does not need to keep all the experience tuples that have high uncertainty, especially the ones that are likely to be outliers. The on-policyness metric is introduced to suppress the adverse effects of those outliers and keep the experience tuples close to the actual transition the current agent takes. We demonstrate that our method can be implemented as simple modifications to the existing implementations for replay buffers. In addition, our approach can be combined with the existing approaches that are used to sample experience tuples from the buffer. We show in the experiments that the proposed method of data pruning can save the memory consumption of the buffer and prevent a performance decrease when the size of the buffer is limited.

2. RELATED WORK

Experience replay using replay buffers is a common method in off-policy RL algorithms to make the most of the experiences obtained by the behavior policy (Mnih et al., 2015; Haarnoja et al., 2018; Hessel et al., 2018) . Most of the research on replay buffers focuses on how the experience tuples are sampled from the buffer when creating a mini-batch to accelerate the training process. The most well-known is the prioritized experience replay (PER) (Schaul et al., 2016) , which prioritizes the sampling of the experience tuples based on their training errors. There have been many studies investigating other effective indicators for sampling the experience tuples to achieve efficient training. Some used fixed metrics to select experience tuples (Schaul et al., 2016; Fujimoto et al., 2020; Sinha et al., 2022) , while others constructed and trained models to select the tuples which maximize the improvement of the policy model after the parameter update (Zha et al., 2019; Oh et al., 2021) . There have also been some studies examining experience selection methods based on indicators such as reward (Pieters & Wiering, 2016 ), surprise (de Bruin et al., 2018) , and exploration (de Bruin et al., 2016a; 2018) , and how they affect the performance of an RL agent in state-based environments using a relatively small buffer. Chen et al. (2021) proposed a method that aims to construct a memory-efficient RL algorithm in vision-based domains. In this method, they freeze the parameters in the convolutional neural network (CNN) encoder at the early stage of the training and store the latent vector from the encoder, instead of storing raw images to save memory consumption. The major difference in our approach is that we choose the experiences to discard. Since the discarding of the experiences is done independently of the sampling process, the sampling methods used in the replay buffer can be combined with our method. We introduce a method to select experience tuples in RL settings. Determining which tuples to keep can be seen in a continual learning setting (Lopez-Paz & Ranzato, 2017; Shin et al., 2017; Isele & Cosgun, 2018; Rolnick et al., 2019) . In these settings, the model is given a stream of data and the training is done in an online manner. These settings suffer from catastrophic forget-ting (French, 1999; Goodfellow et al., 2013) , where the model forgets the information observed in the early stage of the training. One method to prevent this issue is to keep some of the data in a buffer to remember the key components in each domain. In this approach, metrics like the gradient of the samples (Lopez-Paz & Ranzato, 2017; Aljundi et al., 2019) , the feature vectors (Rebuffi et al., 2017) , or the learnability of the data (Sun et al., 2022) are used to select the data to keep. Tasks from different domains are given as the training proceeds in continual learning in RL (Isele & Cosgun, 2018; Rolnick et al., 2019) . The agents need to remember the previous task after being introduced to new tasks or have to adapt quickly using the knowledge from the previous tasks. In our setting, we only consider a single task domain and do not require adaptation to a different task. Catastrophic forgetting is thus not as important as in these methods in our setting.

3. BACKGROUND

We consider a Markov decision process (MDP), defined by (S, A, T , ρ 0 , r, γ), where S is a state space, A is a set of actions, T is a state transition function, ρ 0 is a distribution over the initial states, r : S × A → R is a reward function, and γ ∈ [0, 1) is a discount factor. At each time step t, the agent observes the current state s t of the environment, selects and takes an action a t based on its policy π(a t |s t ). The environment returns an immediate reward r t and the next state s t+1 . The goal of the agent is to learn a stationary policy that maximizes the expected discounted sum of reward R t = E π [ ∞ t=0 γ t r(s t , a t )], where the expectation is calculated over the trajectories sampled from s 0 ∼ ρ 0 , a t ∼ π(•|s t ), and s t+1 ∼ T (s t , a t ) for t ≥ 0. τ t is an experience tuple (s t , a t , r t , s t+1 ) which contains elements observed within a single time step in the environment.

3.1. DEEP REINFORCEMENT LEARNING

The mainstream of off-policy RL algorithms derives from Q-learning (Watkins & Dayan, 1992) , which learns an action-value function called a Q-function. The Q-function is defined as a discounted sum of rewards, after taking an action a on a state s: Q π (s, a) = E π ∞ t=0 γ t r(s t , a t )|s 0 = s, a 0 = a . The Q-function is learned by minimizing the temporal difference (TD) error δ(τ ) defined as: δ(τ ) = (Q(s t , a t ) -Q target (s t , a t )) 2 , Q target (s t , a t ) = r(s t , a t ) + γ max a Q(s t+1 , a). The deep Q-Network (DQN) algorithm (Mnih et al., 2015) approximates a Q-function with a deep neural network. The Q-function has been expressed and learned in various forms, for example, as a combination of a value function and an advantage function (Wang et al., 2016) , or as a function returning the distribution of the Q-values (Bellemare et al., 2017; Dabney et al., 2018) . In the distributed settings, Q-values are represented as a discrete probability distribution p(Q), and commonly the TD error is calculated as the KL-divergence between the current and the target Q-value distributions (Bellemare et al., 2017) : δ(τ ) = D KL (p(Q target (s t , a t ))||p(Q(s t , a t ))).

3.2. EXPERIENCE REPLAY

In an online setting, the collected experience tuples are discarded once they are used for updating the parameters. However, it is difficult for the model to learn from a single update loop, and thus discarding the experience tuples that can be rarely encountered can lead to insufficient training. Experience replay (Lin, 1992 ) is a method that uses a replay buffer to store experience tuples. A replay buffer is usually implemented with a ring buffer storing a fixed number of the latest experience tuples collected by the behavior policy. When the buffer is full, the oldest experience tuple is discarded and the new tuple is appended. By using the experience tuples repeatedly, experience replay improves the sample efficiency of the training algorithm. The simplest and the most common way of creating a batch is a uniform sampling from the buffer. The replay buffer also plays a role in suppressing the temporal correlation of online samples and preventing catastrophic forgetting. Algorithm 1 Proposed method with Q-learning 1: Initialize action-value network Q with parameters θ 2: Initialize replay buffer D with capacity C 3: for each time step t do 4: Select and execute action a t ← arg max a Q θ (s t , a)

5:

Receive reward r t and next state s t+1 6: Calculate the pruning priority f (τ t ) of the experience tuple τ t 7:  if t ≥ C then 8: (τ min , f (τ min )) ← arg min (τ,f (τ ))∈D f (τ ) 9: if f (τ min ) < f (τ t ) then 10: D ← D\{(τ min , f (τ min ))} 11: Store the experience tuple (τ t , f (τ t )) into

4. METHOD

In this section, we present our method for prioritizing the experience tuples by their importance and determining the tuples to be retained according to the calculated priority values. This is used to ensure that the number of tuples fits the capacity of the limited-sized replay buffer. Our method will save the memory required and important experience tuples will be sampled much more frequently, which can result in acceleration of the training. We define the priority calculation function f (τ ) for experience tuple τ t as follows: f (τ t ) = w(s t , a t |Q)δ(τ t ), where δ(τ ) is a value based on surprise and w(s t , a t |Q) is a weight based on the current agents policy calculated using the Q-function. The details of the value and the weight will be described in the following sections.

4.1. SURPRISE

Surprise is a measure of how unexpected the experience tuples are for the model. Surprised-based scores have been used for exploration in reinforcement learning to efficiently find states that are not fully known to the model (Schaul et al., 2016; Burda et al., 2019; Berseth et al., 2021) . Prioritizing experience tuples that have high uncertainty and sampling on the weighted distribution have been shown to accelerate training in deep RL (Schaul et al., 2016; Hessel et al., 2018) . TD-errors are usually used as the uncertainty measure. The experience tuples with larger errors are more unexpected for the model and can have more room left to learn. In distributed Q-learning, TD-errors can be viewed as an approximation of the information gain that can be obtained from the experience tuple τ : I(Q, τ ) = E τ [D KL (p(Q|τ )||p(Q))] ≃ E τ D KL (p(Q target )||p(Q)) = E τ [δ(τ )]. In the second line, we used the assumption that the distribution of the original Q-value given the experience tuple will become close to the target Q-value distribution. This is because the Q-function is trained to minimize the difference between the target Q-function using the experience tuple τ . Selecting tuples with a large TD-error will result in choosing tuples that have large information gain. Figure 2 : The best evaluation performance difference between the proposed method and the baseline method in settings where the replay buffer size is constrained to 10k. The results are normalized by the performance of the unconstrained baseline method.

4.2. ON-POLICYNESS WEIGHT

Selecting experience tuples only from those with large errors will result in unstable training. This is because the tuples that are far from the state and action distribution of the current policy tend to have a large error since they are not seen frequently. To overcome this issue, we additionally introduce on-policyness (Fedus et al., 2020) to the priority metric. Here, on-policyness is defined as how much the experience tuple stored in the buffer reflects the current target policy. The term of on-policyness is important in on-policy RL algorithms where the behavior policy and the target policy need to be the same. A common way to satisfy this condition is to constrain the target policy to stay close to the behavior policy (Schulman et al., 2015; 2017) . The on-policyness of an experience tuple is also important in off-policy training (Zhang & Sutton, 2017; Hausknecht & Stone, 2016; Novati & Koumoutsakos, 2019) . To reflect the on-policyness of the experience tuples to the priority values, we propose a weight function in addition to the TD-error: w(s t , a t |Q) = exp(Q(s t , a t )) a exp(Q(s t , a)) . This weight function w(s, a|Q) reflects how likely the model is to take an action a t at a certain state s t and can be viewed as a soft Q-policy (Haarnoja et al., 2017) . The weight is always set to 1 when the priority is calculated for the first time to ensure that every transition will be used at least once. At first, we considered using a hard policy for measuring the on-policyness. The weight function with a hard policy returns 1 if the action a taken in the experience tuple matches the action that maximizes the Q-value on the state s in the experience tuple, and returns a small value ϵ for tuples containing other actions. This approach is similar to the action probability distribution of an epsilongreedy policy. However, using soft weights performed slightly better in the initial experiments, so we did not use the hard policy for evaluation.

4.3. OVERALL ALGORITHM

The overall algorithm is shown in Algorithm 1. In our algorithm, the priority of each experience tuple is calculated with the function explained above. A priority value f (τ t ) for pruning experience tuples is calculated after the experience τ t is sampled from the environments. We add the priority value to the tuple (τ t , f (τ t )) stored in the buffer. The priority values of tuples are updated only when they are sampled to create a minibatch. The process of experience selection is executed when the buffer is full and a new experience tuple arrives. The tuple with the minimum priority value is discarded. These priority values are used and calculated independently of the ones used in PER. We also considered selecting tuples stochastically instead of selecting them deterministically but found it difficult to adjust the hyperparameters to provide appropriate weighting to select the tuples.

5.1. EXPERIMENTAL SETTINGS

We conducted experiments on 52 Atari (Bellemare et al., 2013) games to investigate the efficacy of our method. Atari is a set of commonly used benchmark environments for visual discrete control. It contains 57 games in total. In our experiments, we used 52 out of 57 environments. We excluded five games from the evaluation because they only provide sparse rewards or require a significant amount of exploration. Those environments are challenging even for the baseline algorithm without any constraint, and since our method does not promote exploration before the baseline algorithm receives an initial reward signal from the environments, we considered them not appropriate for our comparison purpose. We trained our agent using the Rainbow (Hessel et al., 2018) algorithm for 1M training steps with the buffer size constrained to 10k. We ran experiments and compared our methods with the following experience selecting approaches. Baseline: discards the experience tuples in the First In First Out (FIFO) manner. Surprise+S (de Bruin et al., 2018): chooses the experience tuples stochastically using the same sampling method as PER. The sampling priority is based on the inverse of the TD-error δ(τ ). This sampling makes tuples with a small TD-error more likely to be discarded. Exploration+S (de Bruin et al., 2018) : chooses the experience tuples with rare actions. The original implementation used action probabilities which cannot be directly used for Rainbow training. Instead, we used the sampling priority based on soft-policy metrics and sampled the tuples stochastically in the same way as PER. This sampling makes the tuples with rare actions stay longer in the buffer. In addition to these experience selection methods, we added results using SEER (Chen et al., 2021) , which is a data compression method. We also experimented with the setting where no constraints are placed on the buffer size in the baseline method to show the upper bound of the performance. The details of the model architecture and hyperparameters are shown in Appendix A. We run experiments on five random seeds for each environment. The performance of the agent on each seed is evaluated by taking the mean cumulative reward of five evaluation episodes every 10k training steps.

5.2. RESULTS

We summarize the performance in each Atari environment in Figure 2 . We found that our experience selection method suppresses the degradation of the agent performance when the size of the replay buffer is limited. In 39 out of the 52 environments, the proposed method has shown improvement from the baseline results. Figure 3 shows a comparison with the other experience selection methods. The proposed method outperformed the prior experience selection methods and achieved comparable results to the unconstrained baseline in most of the environments. In some environments, such as RoadRunner, the proposed method even accelerated the training compared to the unconstrained baseline. The proposed method showed low performance in environments where the methods that put more weights on exploration had better scores. We provide all the learning curves and the best performance score of the baseline and the proposed method for the Atari environments in Appendix F and Appendix H.

5.3.1. COMPONENT ANALYSIS

We also experimented on six Atari environments with only one component of the priority metrics, i.e., surprise or on-policyness, to investigate the contribution of each factor. The results are shown in Figure 4 , indicating that both components contributed to the final performance. Overall, the performance decrease was larger when the on-policyness metric was removed from the priority function. For four out of the six environments, Atlantis, BattleZone, Defender, and RoadRunner, using the on-policyness metric was enough to obtain the final results. In Enduro, removing the surprise factor slowed down the learning speed, even though using the surprise alone showed a relatively small improvement in performance. 

5.3.2. STATE DISTRIBUTIONS

We plotted the distributions of the latent state representations with t-SNE (Van der Maaten & Hinton, 2008) , to analyze the state distributions of the experience tuples stored in the replay buffer. We also added the reward obtained during the transition in the experience tuple to the plot. We used the replay buffer from the unconstrained baseline, constrained baseline, and constrained proposed, which are the same as the ones shown in Section 5.2. The snapshots of the replay buffer were taken from every 100k training steps and all the transitions in the buffer were used for the constrained settings. As for the unconstrained settings, we used the states observed in between (100k×i + 90k) ∼ (100k ×(i + 1)) (i = 0, 1, • • • , 9 ) training steps. We converted the image state observations into latent representations by using the CNN encoder module from the unconstrained baseline model after 1M training steps. Thus, the latent representations of the observation from different experiment settings share the same latent space in the plotted results. The results are shown in Figure 5 . We can see that the state transitions that the proposed method keeps tend to have larger returns during the transitions than the baseline methods. In Figure 5a , we observed that the proposed method has a relatively higher coverage of state space compared to the baseline with a buffer size of 10k throughout the training. The state distribution of the final training steps in Figure 5b shows that the states are densely distributed in space where the baseline with a constraint has a relatively small amount of samples.

5.3.3. LONGER SETTING

We conducted experiments to examine whether our method would have an advantage in standard off-policy training. We examined this by using the experience selection metrics on a setting where the buffer size is set to 1M, which is a common size used for the replay buffer in off-policy training (Mnih et al., 2015; Schaul et al., 2016; Hessel et al., 2018) . Since our experience selection is performed when there is no space left in the buffer, we trained the agents for 10M in each environment. We used six environments from Atari in total for this experiment. We select five environments where the proposed method performed well, and one environment where the proposed method did not show much improvement. These environments have a common property that a smaller buffer size reduces the performance of the agents. The results are shown in Figure 6 . We see a significant drop in performance for BattleZone and Qbert after around 1M steps of the training. This suggests potential negative effects of our method during a long training period. We will discuss this in detail in Section 6. The proposed method and the baseline performed almost equally well for CrazyClimber and RoadRunner. In Enduro and Defender, the proposed method showed some positive effects. The former showed some acceleration in learning, while the latter showed a smaller decline in performance.

6. LIMITATION AND DISCUSSION

In this paper, we proposed a method to select experience tuples during training. The method successfully prevents the degradation of performance when the capacity of the replay buffer size is limited. The surprise promotes the buffer to store experience tuples that have large uncertainty and on-policyness ensures that the tuples are tied to the current policy of the agent. One limitation of our work is the low performance of the agent in long-term training. Although using TD-errors as an indicator of learnability is reasonable to some extent, it can have negative effects when the agent is trained for a large number of time steps. For example, as the training proceeds and the overall loss decreases, the influence of the states with large loss but with small novelty will increase. For example, the initial states of episodes can have high TD-errors because the return of an episode can vary depending on the trials. Although the on-policyness metric can reduce the effect of out-of-distribution experiences, the agent can easily fall into local optima when the diversity of the data is lost. As a result, the combination of these two metrics has the vulnerability to make the state distribution converge to a small area in the state space. This makes the transition in the replay buffer highly biased. A simple example of this case is shown in Appendix B. One potential solution may be to ensure the diversity of states and actions stored in the replay buffer to some extent.

7. CONCLUSION

We proposed a technique to save the memory consumption of a replay buffer by using estimated importance of the experiences. Our technique is based on the idea that the importance of experiences can be estimated using the TD-errors and the closeness to the decision made by the current target policy. Our experiments demonstrated that using our method of experience retention can reduce the negative effects of a limited-size buffer. We expect that our proposed method makes RL training more applicable to low-resource environments.

REPRODUCIBILITY STATEMENT

We provide detailed descriptions of our experimental setups and implementations in Section 5.1 and Appendix A. We also provide the source code in the supplementary materials. A TRAINING DETAILS The model architecture we used is based on the PFRL (Fujita et al., 2021) implementation of Rainbow algorithm. The length of each episode is 108k frames at maximum. The list of hyperparameters is shown in Table 1 . For experiments using SEER (Chen et al., 2021) , we froze the CNN encoder at 100k training steps. The initial size of the replay buffer is 10k and the capacity is increased by a factor of 2.25 after the CNN encoder is frozen and the observation states are stored as latent vectors. This is the exact gain that the replay buffer obtains when the 4x84x84 sized 8 bits int image observations are converted to 3136 length 32 bits float vectors.

B ANALYSIS OF FAILURE CASE IN MOUNTAIN CAR

We show some possible cases of failure using the proposed method in MountainCar-v0. In successful cases, the state action distribution of the tuples in the replay buffer will form circular curves shown in Figure 8 . The curve starts from the center and ends at the top-right which refers to the bottom of the valley and the top of the hill in Figure 7 . We analyzed the state action distributions of tuples at the early and late stages of training from the failure cases using the surprise and the on-policyness metrics. At the early stage of training, surprise promoted the tuples to spread throughout the state space (Figure 9a ). However, after the models are trained to some extent, the tuples start to gather at the area which is the initial position of the cart in the episodes (Figure 10a ). This is because, at a certain point of training, the TD-errors from the novel states become smaller than the ones at the initial states where the variance of Q-values is relatively high throughout the training. We also found that the tuples selected by the on-policyness metrics can get biased, especially toward the tuples that selected the action "right" (Figure 9b ). This is due to the fact that taking an action "right" on top of the hill tends to have relatively high Q-values compared with other actions and makes the on-policyness weight large. This also resulted in the transitions taken near the goal seen at the early stage remaining until the late stage of the training (10b). In addition, the tuples that were collected by the on-policyness metric had a possibility to form a cluster when the Q-value of a certain action became significantly high compared to other actions. As a result, the combination of surprise and on-policyness can result in convergence to a small area in the state space and little action diversity making the policy collapse (Figure 10c ). 

C APPLICATION TO DOMAINS WITH CONTINUOUS ACTION SPACE

The surprise term can be used for most off-policy algorithms. On the other hand, the on-policyness metric which needs to calculate the softmax values across possible actions at a certain state s cannot be directly used in domains with a continuous action space. One possible approach would be to use the action probability that is calculated in Actor-Critic methods (Lillicrap et al., 2016; Haarnoja et al., 2018) since the on-policyness metric resembles the soft Q-policy. In such a case, assuming that the action probability distribution is given as a normal distribution, the on-policyness weight can be written as: w(s t , a t ) = N (a t |π(a|s t ), σ). The σ can be either the one used in the current target policy or a fixed value. We observed that avoiding a small value for σ when calculating the on-policyness would have slightly better performance. We set σ to 1 throughout the training when calculating the on-policyness. We evaluated this simple modification on Deepmind Control Suite (Tunyasuvunakool et al., 2020) . We used DrQv2 (Yarats et al., 2021) as the RL algorithm and constrained the replay buffer size to 2k, which is the minimum size without changing the original exploration steps. The hyperparameters except for the replay buffer size were kept the same as in the original implementation. The results are shown in Figure 11 . We saw improvements from the baseline results on most of the domains we tested on. 

D DETERMINISTIC SELECTION VS STOCHASTIC SELECTION

Our proposed method deterministically selects the tuples to discard by selecting the tuple with the minimum priority value. However, this process can be modified to a stochastic procedure by using the inverse of the priority value as the sampling priority. The results of the comparisons are shown in Figure 12 . Applying stochastic sampling to the proposed method caused a decrease in the performance in most of the environments. However, there were some environments where the stochastic sampling improved the performance. Those were the environments where the deterministic proposed method performed worse than the baseline. In addition, stochastic sampling had some effect on reducing the loss throughout the training. 

E TEMPERATURE TO SOFT-POLICY METRIC

We investigated the sensitivity of the on-policyness metric by changing the temperature parameter β. Q-value is multiplied by the temperature parameter β before applying the softmax function when calculating the on-policyness. The results are shown in Figure 13 . The setting with β = 0 is the same as the setting where the priority value is calculated only from surprise. We saw a slight change in the performance for most of the environments when the temperature parameter was scaled. 

G SMALLER BUFFER SETTINGS

We also experimented in Atari on settings where the size of the replay buffer is constrained to 1k, which is one-tenth the size of the original constrained setting. All the other hyperparameters except for the buffer size were kept the same as Section A. The results are shown in Figure 16 .

H BEST PERFORMANCE RESULTS

We show the mean best performance of the results shown in Apeendix F and Appendix G in Table 2 . 



Figure 3: Comparison of the cumulative rewards in constrained-memory settings. The solid line and shaded regions represent the mean and standard deviation, respectively, across five random seeds.

Figure 4: Comparison of the cumulative rewards when each factor was removed from the priority calculation function in constrained-memory settings. The solid line and shaded regions represent the mean and standard deviation, respectively, across five random seeds. The dotted line shows the best mean score of the baseline method in the constrained setting.

Figure 5: Comparisons of latent state distributions of images stored in the replay buffer at certain training steps visualized by t-SNE(Van der Maaten & Hinton, 2008). The color of the points is determined by the cumulative discounted reward the agent received within the transition.

Figure 6: Comparison of the cumulative rewards with and without the proposed method when the buffer size is set to 1M. The solid line and shaded regions represent the mean and standard deviation, respectively, across five random seeds. The curves are smoothed with a moving average of 10 to improve readability.

Figure 7: Appearance of MountainCar-v0 Figure 8: State action distribution of tuples in the FIFO replay buffer

Figure 10: State action distribution of tuples stored in the replay buffer at the late stage of training when using the specific pruning strategy in MountainCar-v0.

Figure 11: Comparison of the cumulative rewards with and without the proposed method when the buffer size is constrained to 2k. The solid line and shaded regions represent the mean and standard deviation, respectively, across five random seeds.

Figure 12: Comparison between the performance of deterministic and stochastic selection of the tuples. The solid line and shaded regions represent the mean and standard deviation, respectively, across five random seeds.

Figure 13: Cumulative rewards with various temperature values. The solid line and shaded regions represent the mean and standard deviation, respectively, across five random seeds.

Figure 14: Comparison of the cumulative rewards in constrained-memory settings. The solid line and shaded regions represent the mean and standard deviation, respectively, across five random seeds.

Figure 15: Comparison of the cumulative rewards with and without the proposed method when the buffer size is constrained to 10k. The solid line and shaded regions represent the mean and standard deviation, respectively, across five random seeds.

Figure16: Comparison of the cumulative rewards with and without the proposed method when the buffer size is constrained to 1k. The solid line and shaded regions represent the mean and standard deviation, respectively, across five random seeds.

the replay buffer D Store the experience tuple (τ t , f (τ t )) into the replay buffer D

Parameters for Atari experiments

