MINIMAL VALUE-EQUIVALENT PARTIAL MODELS FOR SCALABLE AND ROBUST PLANNING IN LIFELONG REINFORCEMENT LEARNING

Abstract

Learning models of the environment from pure interaction is often considered an essential component of building lifelong reinforcement learning agents. However, the common practice in model-based reinforcement learning is to learn models that model every aspect of the agent's environment, regardless of whether they are important in coming up with optimal decxisions or not. In this paper, we argue that such models are not particularly well-suited for performing scalable and robust planning in lifelong reinforcement learning scenarios and we propose new kinds of models that only model the relevant aspects of the environment, which we call minimal value-equivalent partial models. After providing the formal definitions of these models, we provide theoretical results demonstrating the scalability advantages of performing planning with such models and then perform experiments to empirically illustrate our theoretical results. Finally, we provide some useful heuristics on how to learn these kinds of models with deep learning architectures and empirically demonstrate that models learned in such a way can allow for performing planning that is robust to distribution shifts and compounding model errors. Overall, both our theoretical and empirical results suggest that minimal value-equivalent partial models can provide significant benefits to performing scalable and robust planning in lifelong reinforcement learning scenarios.

1. INTRODUCTION

It has long been argued that in order for reinforcement learning (RL) agents to perform well in lifelong RL (LRL) scenarios, they should be able to learn a model of their environment, which allows for advanced computational abilities such as counterfactual reasoning and fast re-planning (Sutton & Barto, 2018; Schaul et al., 2018; Sutton et al., 2022) . Even though this is a widely accepted view in the RL community, the question of what kinds of models would better suite for performing LRL still remains unanswered. As LRL scenarios involve large environments with lots of irrelevant aspects and periodic or non-periodic distribution shifts, directly applying the ideas developed in the classical model-based RL literature (see e.g., Ch. 8 of Sutton & Barto, 2018) to these problems is likely to lead to catastrophic results in building scalable and robust lifelong learning agents. Thus, there is a need to rethink some of the ideas developed in the classical model-based RL literature while developing new concepts and algorithms for performing model-based RL in LRL scenarios. In this paper, we argue that one important idea to reconsider is whether if the agent's model should model every aspect of its environment. In classical model-based RL, the learned model is a model over every aspect of the environment. However, due to the large state spaces of LRL environments, these types of models are likely to lead to serious problems in performing scalable model-based RL, i.e., in quickly learning a model and in quickly performing planning with the learned model to come up with an optimal policy. Also, due to the inherent non-stationarity of LRL environments, these types of detailed models are likely to lead to models that overfit to the irrelevant aspects of the environment and cause serious problems in performing robust model-based RL, i.e., learning & planning with models that are robust to distributions shifts and compounding model errors. To this end, we argue that models that only model the relevant aspects of the agent's environment, which we call minimal value-equivalent partial models, would be better suited for performing

