REPLAY BUFFER WITH LOCAL FORGETTING FOR ADAPTIVE DEEP MODEL-BASED REINFORCEMENT LEARNING

Abstract

One of the key behavioral characteristics used in neuroscience to determine whether the subject of study-be it a rodent or a human-exhibits model-based learning is effective adaptation to local changes in the environment. In reinforcement learning, however, recent work has shown that modern deep model-based reinforcement-learning (MBRL) methods adapt poorly to such changes. An explanation for this mismatch is that MBRL methods are typically designed with sample-efficiency on a single task in mind and the requirements for effective adaptation are substantially higher, both in terms of the learned world model and the planning routine. One particularly challenging requirement is that the learned world model has to be sufficiently accurate throughout relevant parts of the statespace. This is challenging for deep-learning-based world models due to catastrophic forgetting. And while a replay buffer can mitigate the effects of catastrophic forgetting, the traditional first-in-first-out replay buffer precludes effective adaptation due to maintaining stale data. In this work, we show that a conceptually simple variation of this traditional replay buffer is able to overcome this limitation. By removing only samples from the buffer from the local neighbourhood of the newly observed samples, deep world models can be built that maintain their accuracy across the state-space, while also being able to effectively adapt to changes in the reward function. We demonstrate this by applying our replay-buffer variation to a deep version of the classical Dyna method, as well as to recent methods such as PlaNet and DreamerV2, demonstrating that deep model-based methods can adapt effectively as well to local changes in the environment.

1. INTRODUCTION

Recent work has shown that modern deep MBRL methods adapt poorly to local changes in the environment (Van Seijen et al., 2020; Wan et al., 2022) , despite this being a key characteristic of model-based learning in humans and animals (Daw et al., 2011) . The analysis by Wan et al. (2022) revealed that there are broadly two causes for this lack of adaptivity: an insufficient world model or insufficient planning. And the former one is an especially challenging one to overcome when deeplearning-based world models are considered. The core of this challenge lies in the fact that adaptivity requires a world model that is accurate across the relevant state-space, as a small change in reward or transition function can change the trajectory of the optimal policy entirely. By contrast, to achieve high single-task sample-efficiency-a common metric used in MBRL research-it is sufficient that the world model is accurate along the current behavior policy. For deep world models, accuracy across the state-space is hard to achieve and maintain, even with sufficient exploration. The reason is that collected samples are strongly correlated, and, at the final stages of learning, mostly come from states along the trajectory of the optimal policy. Due to catastrophic forgetting, the quality of the predictions further away from this trajectory quickly degrades. A common strategy to counter this is to use a replay buffer. By randomly sampling from a large replay buffer and using these samples to update the world model, the effects of catastrophic forgetting are greatly reduced. However, the traditional first-in-first-out (FIFO) replay buffer has the disadvantage that it hinders effective adaptation, as out-of-date samples interfere with the new data. To address the challenge of catastrophic forgetting, while also avoiding interference from out-of-date samples, we propose a variation of the traditional FIFO replay buffer. Instead of removing the oldest sample from the replay buffer once the buffer is full, the oldest sample in the local neighbourhood of the new sample is removed. This conceptually simple idea naturally leads to a replay buffer whose samples are approximately spread out equally across the space space, while local changes are accounted for quickly. Consequently, updating the deep world model with samples drawn randomly from this replay buffer results in a world model that is approximately accurate across the state-space at each moment in time. We call this replay buffer variation a LOFO (Local Forgetting) replay buffer. One practical challenge to our proposed variation is that a locality-function needs to be learned that determines whether or not a sample from the replay buffers falls within the local neighborhood of a newly observed sample. We train this locality function using contrastive learning (Hadsell et al., 2006; Dosovitskiy et al., 2014; Wu et al., 2018) during the initial stages of learning, after which it is fixed and used as basis for the LOFO replay buffer. We demonstrate the effectiveness of the LOFO replay buffer by combining it with a deep version of the classical Dyna method and measuring its adaptivity on a variation of the MountainCar task as well as a mini-grid domain, using the same Local Change Adaptation (LoCA) setup as used by Wan et al. (2022) . We then test the limits of our approach by applying the same idea to both PlaNet (Hafner et al., 2019b) and DreamerV2 (Hafner et al., 2020) , which use world models based on recurrent networks and are intended for continuous-action domains. Experiments with these modified methods on variations of the MuJoCo Reacher domain demonstrate that a LOFO replay buffer can substantially improve adaptivity of more advanced deep MBRL methods as well. Phase 1, the agent is randomly initialized throughout the state-space and has to solve Task A which has a higher reward in T1 compared to T2. In Phase 2, the agent is initialized within the one-way T1-zone and the reward function changes to Task B with T1's reward changing and having a lower reward than T2, hence changing the optimal policy. During evaluation the agent is initialized randomly throughout the entire state-space in both the phases.

2. BACKGROUND: LOCAL CHANGE ADAPTATION (LOCA) SETUP

Inspired by work from neuroscience on detecting model-based behavior in humans and animals (Daw et al., 2011 ), Van Seijen et al. (2020) proposed the Local Change Adaptation (LoCA) regret as an experimental setup and metric to evaluate model based behavior of RL methods. LoCA regret measures how quickly a method adapts its policy after observing a local environment change and tries to estimate how close a method is to ideal model-based behavior. Wan et al. (2022) improved the LoCA setup by making it simpler, less sensitive to hyperparameters and easily applicable to stochastic environments. It evaluates if an RL method can reach close to optimal performance after observing a local environment change and makes a binary classification of methods that effectively adapt to local changes and those that cannot. In this work we use this improved LoCA setup (Figure 1 ) for all our evaluations. It consists of two main components, the task configuration and the experiment configuration. The task configuration considers an environment with two tasks, namely A and B, which differ only in their reward functions. A method's adaptivity is determined by analyzing how effectively it can adapt from task A to task B. A learning environment suited for the LoCA setup is made up of two terminal states, namely T1 and T2. The reward function for both tasks is always zero except when transitioning to either of the terminal states. For task A, the agent receives a reward of 4 upon transitioning to T1 and 2 upon transitioning to T2. For task B, however, a transition to T1 yields a reward of 1 (instead of 4 as in task A), and transitioning to T2 results in the same reward as in task A, which is 2. In addition, both of these tasks share the same discount factor 0 < γ < 1. Finally, the transition dynamics for



Figure1: LoCA Setup: During training, in Phase 1, the agent is randomly initialized throughout the state-space and has to solve Task A which has a higher reward in T1 compared to T2. In Phase 2, the agent is initialized within the one-way T1-zone and the reward function changes to Task B with T1's reward changing and having a lower reward than T2, hence changing the optimal policy. During evaluation the agent is initialized randomly throughout the entire state-space in both the phases.

