MINIMAL VALUE-EQUIVALENT PARTIAL MODELS FOR SCALABLE AND ROBUST PLANNING IN LIFELONG REINFORCEMENT LEARNING

Abstract

Learning models of the environment from pure interaction is often considered an essential component of building lifelong reinforcement learning agents. However, the common practice in model-based reinforcement learning is to learn models that model every aspect of the agent's environment, regardless of whether they are important in coming up with optimal decxisions or not. In this paper, we argue that such models are not particularly well-suited for performing scalable and robust planning in lifelong reinforcement learning scenarios and we propose new kinds of models that only model the relevant aspects of the environment, which we call minimal value-equivalent partial models. After providing the formal definitions of these models, we provide theoretical results demonstrating the scalability advantages of performing planning with such models and then perform experiments to empirically illustrate our theoretical results. Finally, we provide some useful heuristics on how to learn these kinds of models with deep learning architectures and empirically demonstrate that models learned in such a way can allow for performing planning that is robust to distribution shifts and compounding model errors. Overall, both our theoretical and empirical results suggest that minimal value-equivalent partial models can provide significant benefits to performing scalable and robust planning in lifelong reinforcement learning scenarios.

1. INTRODUCTION

It has long been argued that in order for reinforcement learning (RL) agents to perform well in lifelong RL (LRL) scenarios, they should be able to learn a model of their environment, which allows for advanced computational abilities such as counterfactual reasoning and fast re-planning (Sutton & Barto, 2018; Schaul et al., 2018; Sutton et al., 2022) . Even though this is a widely accepted view in the RL community, the question of what kinds of models would better suite for performing LRL still remains unanswered. As LRL scenarios involve large environments with lots of irrelevant aspects and periodic or non-periodic distribution shifts, directly applying the ideas developed in the classical model-based RL literature (see e.g., Ch. 8 of Sutton & Barto, 2018) to these problems is likely to lead to catastrophic results in building scalable and robust lifelong learning agents. Thus, there is a need to rethink some of the ideas developed in the classical model-based RL literature while developing new concepts and algorithms for performing model-based RL in LRL scenarios. In this paper, we argue that one important idea to reconsider is whether if the agent's model should model every aspect of its environment. In classical model-based RL, the learned model is a model over every aspect of the environment. However, due to the large state spaces of LRL environments, these types of models are likely to lead to serious problems in performing scalable model-based RL, i.e., in quickly learning a model and in quickly performing planning with the learned model to come up with an optimal policy. Also, due to the inherent non-stationarity of LRL environments, these types of detailed models are likely to lead to models that overfit to the irrelevant aspects of the environment and cause serious problems in performing robust model-based RL, i.e., learning & planning with models that are robust to distributions shifts and compounding model errors. To this end, we argue that models that only model the relevant aspects of the agent's environment, which we call minimal value-equivalent partial models, would be better suited for performing model-based RL in LRL scenarios. We first start by developing the theoretical underpinnings of how such models could be defined and studied in model-based RL. Then, we provide theoretical results demonstrating the scalability advantages, i.e., the value and planning loss and computational and sample complexity advantages, of performing planning with minimal value-equivalent partial models and then perform several experiments to empirically illustrate these theoretical results. Finally, we provide some useful heuristics on how to learn these kinds models with deep learning architectures and empirically demonstrate that models learned in such a way can allow for performing planning that is robust to distribution shifts and compounding model errors. Overall, both our theoretical and empirical results suggest that minimal value-equivalent partial models can provide significant benefits to performing scalable and robust model-based RL in LRL scenarios. We hope that our study will bring the community a step closer in building model-based RL agents that are able to perform well in LRL scenarios.

2. BACKGROUND

Reinforcement Learning. In RL (Sutton & Barto, 2018) , an agent interacts with its environment through a sequence of actions to maximize its long-term cumulative reward. Here, the environment is usually described as a Markov decision process (MDP) M ≡ (S, A, P, R, γ), where S and A are the (finite) set of states and actions, P : S ×A×S → [0, 1] is the transition distribution, R : S ×A → [0, R max ] is the reward function, and γ ∈ [0, 1) is the discount factor. On the agent's side, through the use of a perfect state encoder ϕ * : S → F, every state s ∈ S can be represented, without any loss of information, as an n-dimensional feature vector f = [f 1 , f 2 , . . . , f n ] ⊤ ∈ F, which consists of n different features F = {f i } n i=1 where f i ∈ F i ∀i ∈ {1, . . . , n} (also see Boutilier et al. (2000) ). Note that as there is no loss of information, F contains all the possible features that are relevant in describing the states of the environment. Thus, from the agent's side, the MDP M can losslessly be represented as another MDP m * = (F, A, p * , r * , γ), where F and A are the (finite) set of feature vectors and actions, p * : F × A × F → [0, 1] and r * : F × A → [0, R max ] are the transition distribution and reward function, and γ ∈ [0, 1) is the discount factor. For convenience, we take the agent's view and refer to the environment as m * throughout this study. The goal of the agent is to learn a value estimator Q : F × A → R that induces a policy π ∈ Π ≡ {π | π : F × A → [0, 1]}, maximizing E π,p * [ ∞ t=0 γ t r * (F t , A t ) | F 0 ] for all F 0 ∈ F. Model-Based RL. One of the prevalent ways of achieving this goal is through the use of modelbased RL methods in which there are two main phases: the learning and planning phases. In the learning phase, the gathered experience is mainly used in learning an encoder ϕ : S → F and a model m ≡ (p, r) ∈ M ≡ {(p, r) | p : F × A × F → [0, 1], r : F × A → [0, R max ]}, and optionally, the experience may also be used in improving the value estimator. In the planning phase, the learned model m is then used either for solving for the fixed point of a system of Bellman equations (Bellman, 1957) , or for simulating experience, either to be used alongside real experience in improving the value estimator, or just to be used in selecting actions at decision time (Alver & Precup, 2022; Sutton & Barto, 2018) . Value-Equivalence. One of the recent trends in model-based RL is to learn models that are specifically useful for value-based planning (see e.g., Silver et al., 2017; Schrittwieser et al., 2020) , which has been recently formalized in several different ways through the studies of Grimm et al. (2020; 2021) . Inspired by these studies, we define a related form of value-equivalence as follows. Let V π m ∈ R |F | be the value vector of a policy π ∈ Π evaluated in model m, whose elements are defined ∀f ∈ F as V π m (f ) ≡ E π,p [ ∞ t=0 γ t r(F t , A t )|F 0 = f ], and let V * m ∈ R |F | be the optimal value vector in model m. We say that a model m ∈ M is a value-equivalent (VE) model of the true environment m * ∈ M if the following equality holds: V π * m m * = V * m * ∀π * m ∈ Π, where π * m is an optimal policy obtained as a result of planning with model m.

3. MINIMAL VALUE-EQUIVALENT PARTIAL MODELS

In classical model-based RL (Ch. 8 of Sutton & Barto, 2018), an agent learns a very detailed model of its environment that models every aspect of it, regardless of whether these aspects are relevant

