MEMORY OF UNIMAGINABLE OUTCOMES IN EXPERI-ENCE REPLAY

Abstract

Model-based reinforcement learning (MBRL) applies a single-shot dynamics 1 model to imagined actions to select those with best expected outcome. The dy-2 namics model is an unfaithful representation of the environment physics, and its 3 capacity to predict the outcome of a future action varies as it is trained iteratively. 4 An experience replay buffer collects the outcomes of all actions executed in the 5 environment and is used to iteratively train the dynamics model. With growing 6 experience, it is expected that the model becomes more accurate at predicting the 7 outcome and expected reward of imagined actions. However, training times and 8 memory requirements drastically increase with the growing collection of experi-9 ences. Indeed, it would be preferable to retain only those experiences that could 10 not be anticipated by the model while interacting with the environment. We argue 11 that doing so results in a lean replay buffer with diverse experiences that corre-12 spond directly to the model's predictive weaknesses at a given point in time. We propose strategies for: i) determining reliable predictions of the dynamics 14 model with respect to the imagined actions, ii) retaining only the unimaginable 15 experiences in the replay buffer, and iii) training further only when sufficient novel 16 experience has been acquired. We show that these contributions lead to lower 17 training times, drastic reduction of the replay buffer size, fewer updates to the 18 dynamics model and reduction of catastrophic forgetting. All of which enable the 19 effective implementation of continual-learning agents using MBRL. 20



)). From a resource perspective, the size and maintenance 36 strategy of the replay buffer pose major concerns for longer learning sessions.

37

The issue of overfitting is also a concern when accumulating similar or repetitive states. The buffer 38 can become inundated with redundant information while consequently under-representing other im-39 portant states. Indefinite training on redundant data can result in an inability to generalize to, or 40 remember, less common states. Conversely, too small a buffer will be unlikely to retain sufficient 41 relevant experience into the future. Ideally, a buffer's size would be the exact size needed to cap-42 ture sufficient detail for all relevant states (Zhang & Sutton (2017)). Note that knowing a priori all 43 relevant states is unfeasible without extensive exploration.



Reinforcement Learning (MBRL) is attractive because it tends to have a lower sample 22 complexity compared to model-free algorithms like Soft Actor Critic (SAC) (Haarnoja et al. (2018)).

MBRL agents function by building a model of the environment in order to predict trajectories of fu-24 ture states based off of imagined actions. An MBRL agent maintains an extensive history of its 25 observations, its actions in response to observations, the resulting reward, and new observation in 26 an experience replay buffer. The information stored in the replay buffer is used to train a single-shot 27 dynamics model that iteratively predicts the outcomes of imagined actions into a trajectory of future 28 states. At each time step, the agent executes only the first action in the trajectory, and then the model 29 re-imagines a new trajectory given the result of this action (Nagabandi et al. (2018)). Yet, many 30 real-world tasks consist in sequences of subtasks of arbitrary length accruing repetitive experiences, 31 for example driving over a long straight and then taking a corner. Capturing the complete dynamics 32 here requires longer sessions of continual learning. (Xie & Finn (2021)) 33 Optimization of the experience replay methodology is an open problem. Choice of size and mainte-34 nance strategy for the replay buffer both have considerable impact on asymptotic performance and 35 training stability (Zhang & Sutton (

annex

1 Under review as a conference paper at ICLR 2023 We argue that these problems can be subverted by employing a strategy that avoids retaining expe-45 riences that the model already has sufficiently mastered. Humans seem to perform known actions 46 almost unconsciously (e.g., walking) but they reflect on actions that lead to unanticipated events 47 (e.g. walking over seemingly solid ice and falling through). Such is our inspiration to attempt to 48 curate the replay buffer based on whether the experiences are predictable for the model.

49

Through this work, we propose techniques to capture both common and sporadic experiences with 50 sufficient detail for prediction in longer learning sessions. The approach comprises strategies for: 51 i) determining reliable predictions of the dynamics model with respect to the imagined actions, ii) 52 retaining only the unimaginable experiences in the replay buffer, iii) training further only when 53 sufficient novel experience has been acquired, and iv) reducing the effects of catastrophic forget-54 ting. These strategies enable a model to self-manage both its buffer size and its decisions to train, 55 drastically reducing the wall-time needed to converge. These are critical improvements toward the 56 implementation of effective and stable continual-learning agents.

57

Our contributions can be summarized as follows: i) contributions towards the applicability of MBRL 58 in continual learning settings, ii) a method to keep the replay buffer size to a minimum without 59 sacrificing performance, iii) a method that reduces the training time. These contributions result in 60 keeping only useful information in a balanced replay buffer even during longer learning sessions. 

