MINIMAL VALUE-EQUIVALENT PARTIAL MODELS FOR SCALABLE AND ROBUST PLANNING IN LIFELONG REINFORCEMENT LEARNING

Abstract

Learning models of the environment from pure interaction is often considered an essential component of building lifelong reinforcement learning agents. However, the common practice in model-based reinforcement learning is to learn models that model every aspect of the agent's environment, regardless of whether they are important in coming up with optimal decxisions or not. In this paper, we argue that such models are not particularly well-suited for performing scalable and robust planning in lifelong reinforcement learning scenarios and we propose new kinds of models that only model the relevant aspects of the environment, which we call minimal value-equivalent partial models. After providing the formal definitions of these models, we provide theoretical results demonstrating the scalability advantages of performing planning with such models and then perform experiments to empirically illustrate our theoretical results. Finally, we provide some useful heuristics on how to learn these kinds of models with deep learning architectures and empirically demonstrate that models learned in such a way can allow for performing planning that is robust to distribution shifts and compounding model errors. Overall, both our theoretical and empirical results suggest that minimal value-equivalent partial models can provide significant benefits to performing scalable and robust planning in lifelong reinforcement learning scenarios.

1. INTRODUCTION

It has long been argued that in order for reinforcement learning (RL) agents to perform well in lifelong RL (LRL) scenarios, they should be able to learn a model of their environment, which allows for advanced computational abilities such as counterfactual reasoning and fast re-planning (Sutton & Barto, 2018; Schaul et al., 2018; Sutton et al., 2022) . Even though this is a widely accepted view in the RL community, the question of what kinds of models would better suite for performing LRL still remains unanswered. As LRL scenarios involve large environments with lots of irrelevant aspects and periodic or non-periodic distribution shifts, directly applying the ideas developed in the classical model-based RL literature (see e.g., Ch. 8 of Sutton & Barto, 2018 ) to these problems is likely to lead to catastrophic results in building scalable and robust lifelong learning agents. Thus, there is a need to rethink some of the ideas developed in the classical model-based RL literature while developing new concepts and algorithms for performing model-based RL in LRL scenarios. In this paper, we argue that one important idea to reconsider is whether if the agent's model should model every aspect of its environment. In classical model-based RL, the learned model is a model over every aspect of the environment. However, due to the large state spaces of LRL environments, these types of models are likely to lead to serious problems in performing scalable model-based RL, i.e., in quickly learning a model and in quickly performing planning with the learned model to come up with an optimal policy. Also, due to the inherent non-stationarity of LRL environments, these types of detailed models are likely to lead to models that overfit to the irrelevant aspects of the environment and cause serious problems in performing robust model-based RL, i.e., learning & planning with models that are robust to distributions shifts and compounding model errors. To this end, we argue that models that only model the relevant aspects of the agent's environment, which we call minimal value-equivalent partial models, would be better suited for performing model-based RL in LRL scenarios. We first start by developing the theoretical underpinnings of how such models could be defined and studied in model-based RL. Then, we provide theoretical results demonstrating the scalability advantages, i.e., the value and planning loss and computational and sample complexity advantages, of performing planning with minimal value-equivalent partial models and then perform several experiments to empirically illustrate these theoretical results. Finally, we provide some useful heuristics on how to learn these kinds models with deep learning architectures and empirically demonstrate that models learned in such a way can allow for performing planning that is robust to distribution shifts and compounding model errors. Overall, both our theoretical and empirical results suggest that minimal value-equivalent partial models can provide significant benefits to performing scalable and robust model-based RL in LRL scenarios. We hope that our study will bring the community a step closer in building model-based RL agents that are able to perform well in LRL scenarios.

2. BACKGROUND

Reinforcement Learning. In RL (Sutton & Barto, 2018) , an agent interacts with its environment through a sequence of actions to maximize its long-term cumulative reward. Here, the environment is usually described as a Markov decision process (MDP) M ≡ (S, A, P, R, γ), where S and A are the (finite) set of states and actions, P : S ×A×S → [0, 1] is the transition distribution, R : S ×A → [0, R max ] is the reward function, and γ ∈ [0, 1) is the discount factor. On the agent's side, through the use of a perfect state encoder ϕ * : S → F, every state s ∈ S can be represented, without any loss of information, as an n-dimensional feature vector f = [f 1 , f 2 , . . . , f n ] ⊤ ∈ F, which consists of n different features F = {f i } n i=1 where f i ∈ F i ∀i ∈ {1, . . . , n} (also see Boutilier et al. (2000) ). Note that as there is no loss of information, F contains all the possible features that are relevant in describing the states of the environment. Thus, from the agent's side, the MDP M can losslessly be represented as another MDP m * = (F, A, p * , r * , γ), where F and A are the (finite) set of feature vectors and actions, p * : F × A × F → [0, 1] and r * : F × A → [0, R max ] are the transition distribution and reward function, and γ ∈ [0, 1) is the discount factor. For convenience, we take the agent's view and refer to the environment as m * throughout this study. The goal of the agent is to learn a value estimator Q : F × A → R that induces a policy π ∈ Π ≡ {π | π : F × A → [0, 1]}, maximizing E π,p * [ ∞ t=0 γ t r * (F t , A t ) | F 0 ] for all F 0 ∈ F. Model-Based RL. One of the prevalent ways of achieving this goal is through the use of modelbased RL methods in which there are two main phases: the learning and planning phases. In the learning phase, the gathered experience is mainly used in learning an encoder ϕ : S → F and a model m ≡ (p, r) ∈ M ≡ {(p, r) | p : F × A × F → [0, 1], r : F × A → [0, R max ]}, and optionally, the experience may also be used in improving the value estimator. In the planning phase, the learned model m is then used either for solving for the fixed point of a system of Bellman equations (Bellman, 1957) , or for simulating experience, either to be used alongside real experience in improving the value estimator, or just to be used in selecting actions at decision time (Alver & Precup, 2022; Sutton & Barto, 2018) . Value-Equivalence. One of the recent trends in model-based RL is to learn models that are specifically useful for value-based planning (see e.g., Silver et al., 2017; Schrittwieser et al., 2020) , which has been recently formalized in several different ways through the studies of Grimm et al. (2020; 2021) . Inspired by these studies, we define a related form of value-equivalence as follows. Let V π m ∈ R |F | be the value vector of a policy π ∈ Π evaluated in model m, whose elements are defined ∀f ∈ F as V π m (f ) ≡ E π,p [ ∞ t=0 γ t r(F t , A t )|F 0 = f ], and let V * m ∈ R |F | be the optimal value vector in model m. We say that a model m ∈ M is a value-equivalent (VE) model of the true environment m * ∈ M if the following equality holds: V π * m m * = V * m * ∀π * m ∈ Π, where π * m is an optimal policy obtained as a result of planning with model m.

3. MINIMAL VALUE-EQUIVALENT PARTIAL MODELS

In classical model-based RL (Ch. 8 of Sutton & Barto, 2018) , an agent learns a very detailed model of its environment that models every aspect of it, regardless of whether these aspects are relevant in the process of coming up with optimal decisions or not. However, in LRL scenarios, where the agent is "small" and the environment is "vast" (Schaul et al., 2018) , this approach is likely to be problematic as modeling every aspect of the environment becomes quite impractical. Even if the agent overcomes its capacity limitations and manages to model every aspect, as we will demonstrate, these kinds of detailed models can lead to large planning losses and dramatically slowdown both the model-learning and planning processes. And, as we will further demonstrate, detailed models can also be fragile to the distribution shifts in the environment and to the compounding model errors that happen during the unrollment of the learned model. In order to overcome these challenges, we start by proposing new kinds of models that only model certain aspects, either relevant or irrelevant, of the agent's environment. For this, we first start by clarifying the notion of "aspect": in this study, by "aspect", we mean a feature of the environment f i ∈ F i that is learnable by the agent (see Sec. 2). We are now ready to define partial models: Definition 1 (Partial Models). Given a set of features  F P × A × F P → [0, 1], r P : F P × A → [0, R max ]}. According to Defn. 1, any model that only models certain features of the environment is a partial model of the environment m * ∈ M. However, in order for a partial model to be useful, it should be able to model the relevant features of the environment that allow for achieving the task of interest. In order to separate out the relevant features from the irrelevant ones, we define the relevant ones as: Definition 2 (Relevant Features). Given a set of features F, let F R ⊂ F. Let F R be a space of feature vectors in which the feature vectors consist of the features in F R . We say that the features f i ∈ F R are relevant features of the task of interest if they are necessary and sufficient for defining a space of models  M R ≡ {(p R , r R ) | p R : F R × A × F R → [0, 1], r R : F R × A → [0, R max ]} F VEP × A × F VEP → [0, 1], r VEP : F VEP × A → [0, R max ]}. We say that m VEP is a VE partial model of the true environment m * ∈ M if it is a VE model of m * , i.e., V π * m VEP m * = V * m * ∀π * mVEP ∈ Π VEP , ) where π * mVEP is an optimal policy obtained as a result of planning with model m VEP and Π VEP ≡ {π | π : F VEP × A → [0, 1]}. Although it is important to learn partial models that at the very least model the relevant aspects of the environment, as we will theoretically and empirically demonstrate, partial models are mostly beneficial when they only model the relevant aspects of the environment, i.e., when F VEP = F R . We refer to these models as minimal VE partial models. Note that minimal VE partial models are a special class of VE partial models, and VE partial models are a special class of partial models. Illustrative Example. As an illustration of the models defined above, let us start by considering the Squirrel's World (SW) environment depicted in Fig. 1 , in which the squirrel's (the agent) job is to navigate from cell E1 to cell E16 to pickup the nut without getting caught by the hawk that flies back and forth horizontally along row C. At each time step, the squirrel receives as input an 5×16 image of the current state of the environment and then, through the use of a predefined state encoder, transforms this image into a feature vector that contains information regarding all aspects of the current state of the environment, i.e., the feature vector contains information on the current position of the squirrel, hawk and the cloud, the current direction of the hawk, the current wind direction in rows A and B and the current weather condition. Based on this, the squirrel selects an action that either moves it to the left or right cell, or keeps it position fixed. If the squirrel gets caught by the hawk or if it is out of time, it receives a reward of 0 and the episode terminates, and if the squirrel successfully navigates to the nut, it gets a reward of +10 and the episode terminates. In this environment, as the hawk moves 5x the speed of the squirrel, a straightforward policy of always moving to the right will not get the squirrel to the nut. Thus, the squirrel has to come up with non-trivial policies that take into account both the cells with bushes (see e.g., cells E2, E3), which allow for sheltering, and the position and direction of the hawk. In this environment, examples of partial models can be a model that only models the cloud position and the wind direction for rows A and B, or a model that only models the weather condition and the hawk's direction. However, for a partial model to be VE or minimal VE, it has to model the relevant features for the tasks of interest which is reaching the nut. In the SW environment, there are three relevant features: (i) the squirrel's position, (ii) the hawk's position, and (iii) the hawk's direction, as the squirrel would have to have access to all three of these features to come up with optimal policies. Thus, an example of a VE partial model can be a model that models both the three relevant features and the weather condition, and an example of a minimal VE partial model can be a model that only models the three relevant features.

4. THEORETICAL RESULTS

In this section, we first analyze the value and planning losses (Sec. 4.1) of VE partial models and then derive formal results demonstrating the computational and sample complexity benefits (Sec. 4.2) of using such models. We then discuss scenarios where the VE partial model is a minimal one.

4.1. VALUE AND PLANNING LOSS ANALYSES

We start our formal analysis by studying the value loss incurred due to planning with a VE partial model m VEP in place of the true environment m * . To simplify the analysis, we assume that the agent already has access to this model and does not need to learn it. Theorem 1. Let m VEP ∈ M VEP be a VE partial model of the true environment m * ∈ M. Then, the value loss between an optimal policy in m * , π * , and an optimal policy in m VEP , π * mVEP is given by: V * m * -V π * m VEP m * ∞ = 0. Due to space constraints, we defer all the proofs to App. A. Theorem 1 says that by planning with a (non-minimal or minimal) VE partial model, an agent would incur no value loss compared to planning with the true environment itself. Next, we study the planning loss (Jiang et al., 2015) incurred due to planning with an approximate VE partial model mVEP ∈ M VEP in place of the actual VE partial model m VEP ∈ M VEP . Similar to Jiang et al. (2015) , we also consider the certainty-equivalence control setting in which the agent acts according to a policy that is optimal with respect to its current approximate model. Theorem 2. Let m VEP ∈ M VEP be a VE partial model of the true environment m * ∈ M, and let mVEP ∈ M VEP be model that comprises of the reward function of m VEP and a transition distribution that is estimated from n samples for each (f, a) pair. Let Π rVEP ≡ {π | ∃ p VEP s.t π is optimal in (p VEP , r VEP )}. Then, certainty-equivalence planning with mVEP has planning loss: V * mVEP -V π * mVEP mVEP ∞ ≤ 2R max (1 -γ) 2 1 2n log 2|F VEP ||A||Π rVEP | δ , with probability at least 1 -δ. Theorem 2 implies that given a fixed amount of data, the upper bound of the planning loss of a VE partial model depends on both the size of its feature vector space, |F VEP |, and the size of its policy class being searched over by planning, |Π rVEP |. 1 This in turn implies that, given a fixed amount of data, compared to a regular model, a VE partial model is likely to have less planning loss and this loss is likely to be minimized when the VE partial model is a minimal one.

4.2. COMPUTATIONAL AND SAMPLE COMPLEXITY BENEFITS

We now study the computational and sample complexity benefits of performing model-based RL with VE partial models. Due to the well-established theoretical results around it, we choose to study these benefits in the context of value iteration (Bertsekas & Tsitsiklis, 1996) . However, we note that the implications of our results would apply to a wide variety of planning algorithms. Starting with the computational complexity benefits, it is well-known that the computational complexity of performing a single step of value iteration with an arbitrary model et al., 2022) . Thus, the computational complexity of performing a single step of value iteration with a VE partial model m VEP ∈ M VEP would be O(|F VEP | 2 |A|). This implies that compared to planning with regular models, planning with VE partial models would provide a significant computational complexity benefit and this benefit would be maximized when the model used for planning is a minimal VE partial model. m ∈ M is O(|F| 2 |A|) (Agarwal Moving on to the sample complexity benefits, previous studies of Kearns & Singh (1998)  m = O |F VEP ||A| (1 -γ) 4 ε 2 , and let Q k mVEP be the value returned by Q-value iteration at the kth epoch. Then, with probability greater than 1 -δ, the following holds for all f ∈ F VEP and a ∈ A: Q k mVEP -Q * mVEP ∞ ≤ ε, where k = log(ε(1-γ)) log γ and Q * mVEP is the optimal action value function in m VEP . Theorem 3 implies that compared to a regular model, a VE partial model is likely to require less samples in obtaining an ε estimation of the optimal action value function through the use of Q-value iteration with a generative model, and the number of samples required is likely to be minimized when the VE partial model is a minimal one.

5. EXPERIMENTAL RESULTS

We start this section by performing experiments to demonstrate the scalability advantages of minimal VE partial models, which are illustrations of the theoretical results derived in Sec. 4, and then we perform experiments to demonstrate the robustness advantages of these models. The details of our experiments can be found in App. C. Environments. We perform experiments on both the SW environment (see Fig. 1 ) and on variations of the Two Rooms Dynamic Obstacles (2RDO) environment that are built on top of Minigrid (Chevalier-Boisvert et al., 2018) (see Fig. 2 ), as these environments allow for designing controlled experiments that are helpful in answering the questions of interest to this study. Some of the details of the SW environment are already presented in Sec. 3 and we refer the reader to App. C for more details. In the 2RDO environments, the agent, depicted by the red triangle, spawns in top-left of the top room and has to navigate to the green goal cell located in the bottom-right of the same room, regardless of the gaseous motions of the obstacles in the bottom room. At each time step, the agent receives an image of the current state of the grid and then, through the use of a learned state encoder, transforms this image into a feature vector. Based on this, the agent selects an action that either turns it left or right, or moves it forward. If the agent successfully navigates to the goal cell, it receives a reward of +1 and the episode terminates. More details on the 2DRO environments can be found in App. C as well.

5.1. SCALABILITY EXPERIMENTS

For our scalability experiments, we perform experiments with several non-VE (m 1 , m 2 , m 3 ) and VE (m 4 , m 5 , m 6 ) partial models of both the deterministic and stochastic versions of the SW environment, referred to as Det-SW and Stoch-SW, respectively. The details of these models can be found in Table 1 . For all of our experiments, we use value iteration as our planning algorithm. Question 1. Do minimal VE partial models allow for planning with no value loss? In Sec. 4.1, we argued that by planning with a (non-minimal or minimal) VE partial model, an agent would incur no value loss compared to planning with the true environment itself. To empirically verify this, we present the agent with a set of non-VE partial models m 1 , m 2 , m 3 and a minimal VE partial model m 4 , and compare the value losses on both the Det-SW and Stoch-SW environments. Results are shown in Fig. 3a . We can indeed see that while the VE partial model incurs no value loss, the non-VE ones do incur serious value losses. Question 2. Do minimal VE partial models allow for planning with less planning loss? In Sec. 4.1, we argued that given a fixed amount of data, compared to a regular model, a VE partial model is likely to incur less planning loss, and this loss is likely to be minimized when the VE partial model is a minimal one. For empirical verification, we compare the planning losses of a minimal VE partial model m 4 , two (non-minimal) VE partial models m 5 and m 6 , and a regular model m 7 , across dataset sizes of 3, 5, 10 and 20, which corresponds to the number of samples for each (f, a) pair, on the Stoch-SW environment. Results in Fig. 3b show that, as expected, VE partial models indeed incur less planning losses than regular models, and the minimal VE partial model incurs the least planning loss. Question 3. Do minimal VE partial models provide computational complexity benefits? In Sec. 4.2, we argued that compared to regular models, planning with VE partial models would provide a significant computational complexity benefit and this benefit would be maximized when the model used for planning is a minimal VE partial model. To empirically verify this, we present the agent with a minimal VE partial model m 4 , two VE partial models m 5 and m 6 , and a regular model m 7 of the Det-SW environment, and compare the average time it takes to perform a single step of value iteration for each of these models. Results are shown in Fig. 3c . As can be seen, planning with VE partial models indeed provides significant computational complexity benefits, and this benefit is maximized when the VE partial model is a minimal one. In Sec. 4.2, we argued that compared to regular models, planning with VE partial models is likely to provide a sample complexity benefit and this benefit is likely to be maximized when the model that is used for planning is a minimal VE partial model. For empirical verification, we present the agent with a minimal VE partial model m 4 and with a regular model m 7 as generative models, and compare the sample efficiencies, as a result of performing Qvalue iteration, on the Det-SW and Stoch-SW environments. In these experiments, after every episodic interaction, the agent updates its model with the collected trajectory, and then performs Q-value iteration until convergence. Results in Fig. 4 show that, as expected, planning with minimal VE partial models indeed provides significant sample efficiency benefits compared to planning with regular models.

5.2. ROBUSTNESS EXPERIMENTS

For our robustness experiments, we perform experiments on variations of the 2RDO environment with grid sizes of 8x8 and 16x16. For convenience, we will refer to these environments with their grid size followed by their obstacle type. For example. we will refer to the 8x8 2DRO environment with red balls as 8x8 RedBalls (see Fig. 2 ). For all of our experiments, we use the straightforward decision-time planning algorithm of Zhao et al. ( 2021) (see Alg. 2) whose details can be found in App. C. As this algorithm makes use of neural networks, before moving on to the robustness experiments, we try to answer the following question. Question 5. How to learn minimal VE partial models with deep learning architectures? So far, for illustration purposes, we have only performed experiments in which we had a direct control over the features of the agent's model (see the models in Table 1 ). However, in realistic scenarios, the agent would have to come up on its own with a set of features to build a model of the only relevant aspects of its environment. A very popular way of letting the agent come up with its own features is to use neural networks in the representation of the agent's encoder, value estimator and model, and then to train it end-to-end on the environment of interest. However, in order for the agent to come up with only the relevant features, it has to be trained with the right inductive biases. Even though finding the right inductive biases to train a model-free or model-based RL agent is still an open problem in the representation learning literature (Bengio et al., 2013) , in this study, we propose two inductive biases that are likely to guide the agent in coming up with only the relevant features. The first one is to only let the value estimator shape the encoder and prevent the model from doing so (see Fig. 7 ). In this way, the agent can be guided in learning the features that are relevant for predicting the right values in the environment. And, the second one is to train the agent across a variety of environments in which the irrelevant aspects keep changing and the relevant ones stay the same. In this way, the agent can be guided in not learning the irrelevant aspects of the environment. In order to test the usefulness of these two inductive biases in coming up with only the relevant features of the environment, we compare three different agents: (i) a regular agent, A REG , that was trained on the 8x8 BlueBalls environment and whose encoder was jointly shaped by its value estimator and model, (ii) an agent, A VES , that was again trained on the 8x8 BlueBalls environment, but whose encoder was only shaped by its value estimator, and (iii) an agent, A VES+ME , that was trained on the 8x8 BlueBalls, GreenBalls, PurpleBalls and YellowBalls environments and whose encoder was only shaped by its value estimator. We compare these agents on the 8x8 BlueBalls and NoObstacles environments. If the agent is successful in coming up with only the relevant features of the environment, which are the positions of the agent and the goal, and not the positions and motions of the obstacles, we would expect it to perform similarly on the 8x8 BlueBalls and 8x8 NoObstacles environments. Results are shown in Fig. 5a & 5b. As can be seen, even though all of the agents perform well on the 8x8 BlueBalls environment, the A REG agent completely fails on the 8x8 NoObstacles environment, demonstrating that without the necessary inductive biases an agent is not capable of coming up with only the relevant features itself. We can also see that the A VES agent achieves a better performance than the A REG agent and that the A VES+ME agent achieves an even better performance than the A VES agent, demonstrating the usefulness of our proposed inductive biases in inducing models that display the behavior of minimal VE partial models. In order to test the scalability of our results, we have also performed the same experiments with 16x16 versions of the environments. As can be seen in Fig. 5c & 5d, we obtain similar results. Question 6. Can minimal VE partial models be useful for performing robust transfer? As minimal VE partial models only model the relevant aspects of the environment, we would expect them to be robust to the distribution shifts happening in the irrelevant aspects of the environment. In order to test this, we compare the performances of the A REG , A VES and A VES+ME agents on the 8x8 and 16x16 RedBalls, GreyBalls, RedBoxes and GreyBoxes environments. Results are shown in Fig. 5e-5l . As can be seen, while the A REG agent fails and the A VES agent only shows signs of robust transfer, the A VES+ME agent is able to perform robust transfer without any problem. These results illustrate the ability of minimal VE partial models in performing robust transfer. Question 7. Are minimal VE partial models more robust to compounding model errors? As minimal VE partial models only model the relevant aspects of the environment, compared to regular models, we would expect them to be less susceptible to compounding model errors during planning. In order to test this, we compare the performances of the A REG and A VES+ME agents with search budgets of 20, 40 and 80 on the 16x16 BlueBalls environment. Note that this environment has been seen before by both of the agents. Results in Fig. 6 show that while the performance of A REG agent drops significantly with the increase in the search budget, the performance of the A VES+ME agent stays close to optimal, demonstrating the robustness of minimal VE partial models to compounding model errors. Partial Models. In the context of RL, the initial studies of partial models can be dated back to the seminal study of Talvitie & Singh (2008) which proposes to learn several models of an uncontrolled dynamical systems that are partial at the observation level. In contrast, we propose to learn a single and useful partial model of a controlled dynamical system that is partial at the feature level, which provides several advantages such as eliminating the question of how to combine the learned models, using them for control purposes, and making them compatible with function approximation. Our work also has a very close connection to the study of Zhao et al. (2021) which proposes a transformer-based deep model-based agent that dynamically attends to relevant parts of its state representation during planning. However, our work differs in that we propose the general concept of partial models for LRL that is independent of the agent's implementation details. Lastly, another related line of research is the studies of Khetarpal et al. (2020; 2021) on affordances which focus on building models that partial in the action space. Our study is complementary to these studies in that they can still leverage (non-minimal or minimal) VE partial models to reduce the size of the feature space and further increase the benefits of performing model-based RL with partial models.

Value-Equivalence.

A recent trend in model-based RL is to learn models that are specifically useful for value-based planning (see e.g. Silver et al., 2017; Oh et al., 2017; Farquhar et al., 2017; Schrittwieser et al., 2020; Grimm et al., 2020; 2021) . Even though our work also advocates the idea that models should be useful in value-based planning, our work differs in that we also argue that the explicit partiality of the models can provide significant scalability and robustness benefits when performing model-based RL in LRL scenarios. Planning in Learned Feature Spaces. Even though there has been recent studies that study the effect of the introduction of the irrelevant features in the agent's learned representation (Efroni et al., 2022a; b) , our study differs in that we are mainly interested in LRL environments in which environment mostly consists of irrelevant features and the relevant features to the agent do not change over time. Our work is also different from the studies that learn models through self-supervised learning (see e.g., Sekar et al., 2020) in that we explicitly study the structure of the learned representation having relevant and irrelevant components.

7. CONCLUSION AND DISCUSSION

In conclusion, in this study, we have introduced special types of models, called minimal VE partial models, that only model the relevant aspects of the environment and are particularly useful in LRL scenarios. Our theoretical results suggest that these models can provide significant advantages in the value and planning losses that are incurred during planning and in the computational and sample complexity of planning. Our empirical results (i) validate our theoretical results and show that these models can scale to large environments, and (ii) show that these models can be robust to distribution shifts and compounding model errors. Overall, our findings suggest that minimal VE partial models can provide significant advantages in performing model-based RL in LRL scenarios. One limitation of our work is that, rather than providing a principled method, we have only provided several heuristics for training deep RL agents that can come up with only the relevant features of the environment. However, we note that this is mainly due to the lack of principled approaches in the representation learning literature, and we believe that this limitation can be overcomed with more principled approaches being introduced in the literature. We hope to tackle this limitation in future work. Another important limitation is that, due to the need to perform illustrative and controlled experiments, we have only performed experiments in the SW and 2RDO environments where there is just a single task and there is no sequence of tasks, requiring a model the same relevant features, that unfold over time. However, experiments in more environments that have this sequential nature can be helpful in further validating the advantages of minimal VE partial models in LRL scenarios, which we also hope to tackle in future work. A PROOFS Theorem 1. Let m VEP ∈ M VEP be a VE partial model of the true environment m * ∈ M. Then, the value loss between an optimal policy in m * , π * , and an optimal policy in m VEP , π * mVEP is given by: V * m * -V π * m VEP m * ∞ = 0. Proof. This result directly follows from Defn. 3. Recall that, according to Defn. 3, we have: V π * m VEP m * = V * m * ∀π * mVEP ∈ Π VEP , which implies: V * m * -V π * m VEP m * ∞ = 0 ∀π * mVEP ∈ Π VEP . Theorem 2. Let m VEP ∈ M VEP be a VE partial model of the true environment m * ∈ M, and let mVEP ∈ M VEP be model that comprises of the reward function of m VEP and a transition distribution that is estimated from n samples for each (f, a) pair. Let Π rVEP ≡ {π | ∃ p VEP s.t π is optimal in (p VEP , r VEP )}. Then, certainty-equivalence planning with mVEP has planning loss: V * mVEP -V π * mVEP mVEP ∞ ≤ 2R max (1 -γ) 2 1 2n log 2|F VEP ||A||Π rVEP | δ , with probability at least 1 -δ. Proof. Similar to Jiang et al. (2015) , we prove Theorem 2 with two lemmas: Lemma 1 translates planning loss to value error, and Lemma 2 relates value error to a Bellman-residual-like quantity that has a uniform deviation bound which depends on |Π rVEP |. Lemma 1. For any mVEP = (p VEP , rVEP ) with rVEP bounded by [0, R max ], V * mVEP -V π * mVEP mVEP ∞ ≤ 2 max π:F →A V π mVEP -V π mVEP ∞ . In particular, if rVEP = r VEP , we have V * mVEP -V π * mVEP mVEP ∞ ≤ 2 max π∈Πr VEP V π mVEP -V π mVEP ∞ . (12) Proof. ∀f ∈ F VEP , V π * m VEP mVEP (f ) -V π * mVEP mVEP (f ) = V π * m VEP mVEP (f ) -V π * m VEP mVEP (f ) -V π * mVEP mVEP (f ) -V π * mVEP mVEP (f ) (13) + V π * m VEP mVEP (f ) -V π * mVEP mVEP (f ) ≤ V π * m VEP mVEP (f ) -V π * m VEP mVEP (f ) -V π * mVEP mVEP (f ) -V π * mVEP mVEP (f ) (14) ≤ 2 max π∈ π * m VEP ,π * mVEP |V π mVEP (f ) -V π mVEP (f )|. Eqn. 11 follows from taking the max over all feature vectors on both sides of the inequality and noticing that the set of all policies is a trivial superset of π * mVEP , π * mVEP . If rVEP = r VEP , the bound can be tightened since π * mVEP , π * mVEP ∈ Π rVEP , and Eqn. 12 follows. Lemma 2. For any mVEP = (p VEP , rVEP ) with rVEP bounded by [0, R max ], ∀π : F VEP → A, Q π mVEP -Q π mVEP ∞ ≤ 1 1 -γ max f ∈FVEP,a∈A rVEP (f, a) + γ⟨p VEP (f, a, •), V π mVEP ⟩ -Q π mVEP (f, a) . Proof. Given any policy π, define action value functions such that Q 0 , Q 1 , . . . , Q n , . . . such that Q 0 = Q π mVEP , and Q n (f, a) = rVEP (f, a) + γ⟨p VEP (f, a, •), V n-1 ⟩, (17) where V n-1 (f ) = Q n-1 (f, π(f )). Notice that ∥Q n -Q n-1 ∥ ∞ = γ max f ∈FVEP,a∈A |⟨p VEP (f, a, •), (V n-1 -V n-2 )⟩| (18) ≤ γ max f ∈FVEP,a∈A ||p VEP (f, a, •)|| 1 ||V n-1 -V n-2 || ∞ (19) = γ||V n-1 -V n-2 || ∞ (20) ≤ γ||Q n-1 -Q n-2 || ∞ , so ||Q n -Q 0 || ∞ ≤ n-1 k=0 ||Q k+1 -Q k || ∞ (22) ≤ ||Q 1 -Q 0 || ∞ n-1 k=0 γ k-1 . ( ) Taking the limit of n → ∞, Q n → Q π mVEP , and we have, Q π mVEP -Q 0 ∞ ≤ 1 1 -γ ||Q 1 -Q 0 || ∞ . This completes the proof, noticing that Q 0 = Q π mVEP , V 0 = V π mVEP , and Q 1 (f, a) = rVEP (f, a) + γ⟨p VEP (f, a, •), V π mVEP ⟩. From Eqn. 12 in Lemma 1 and Lemma 2, we have V * mVEP -V π * mVEP mVEP ∞ ≤ 2 max π∈Πr VEP V π mVEP -V π mVEP ∞ (25) ≤ 2 max π∈Πr VEP Q π mVEP -Q π mVEP ∞ (26) = 2 max f ∈FVEP,a∈A,π∈Πr VEP Q π mVEP (f, a) -Q π mVEP (f, a) ∞ (27) ≤ 2 1 -γ max f ∈FVEP,a∈A,π∈Πr VEP rVEP (f, a) + γ⟨p VEP (f, a, •), V π mVEP ⟩ -Q π mVEP (f, a) . For any particular f , a, π tuple, according to Hoeffding's inequality, ∀t > 0, p rVEP (f, a) + γ⟨p VEP (f, a, •), V π mVEP ⟩ -Q π mVEP (f, a) > t ≤ 2 exp - 2nt 2 R 2 max /(1 -γ) 2 , ( ) as rVEP (f, a) + γ⟨p VEP (f, a, •), V π mVEP ⟩ is the average of i.i.d. samples bounded in [0, R max /(1 -γ)], with mean Q π mVEP (f, a). To obtain a uniform bound over all (f, a, π) tuples, we set the right-hand side of Eqn. 29 to δ/|F VEP ||A||Π rVEP | and solve for t, and the theorem follows.  m = O |F VEP ||A| (1 -γ) 4 ε 2 , ( ) and let Q k mVEP be the value returned by Q-value iteration at the kth epoch. Then, with probability greater than 1 -δ, the following holds for all f ∈ F VEP and a ∈ A: Q k mVEP -Q * mVEP ∞ ≤ ε, where k = log(ε(1-γ)) log γ and Q * mVEP is the optimal action value function in m VEP . Proof. Before starting the proof, let us first define generative models. A generative model, or a sampler, is a model that can provide us with samples f ′ ∼ p(f, a, •) for all f ∈ F VEP and a ∈ A. Now that we have defined generative models, let us assume we have access to a generative model m VEP and suppose we call our this model N times at each (f, a) pair. Let p be the transition distribution of our empirical model, defined as follows: p(f, a, f ′ ) = count(f, a, f ′ ) N = N i=1 I f ′ i =f ′ N , where f i ∼ p(f, a, •), ∀i ∈ {1, . . . , N }, and count(f, a, f ′ ) is the number of times the pair (f, a) transitions to f ′ . Moving on the main proof, by adding and subtracting Q π * mVEP mVEP , we can rewrite Q k mVEP -Q * mVEP as follows: Q k mVEP -Q * mVEP = Q k mVEP -Q π * mVEP mVEP (i) + Q π * mVEP mVEP -Q * mVEP (ii) Bounding Term (i): Q k mVEP -Q π * mVEP mVEP ∞ = max f ∈FVEP,a∈A r VEP (f, a) + γ pVEP V k-1 mVEP (f, a) -r VEP (f, a) + γ pVEP V π * mVEP mVEP (f, a) (34) = max f ∈FVEP,a∈A γ pVEP V k-1 mVEP -V π * mVEP mVEP (f, a) ≤ γ V k-1 mVEP -V π * mVEP mVEP ∞ (36) ≤ γ max f ∈FVEP max a∈A Q k-1 mVEP (f, a) -max a∈A Q π * mVEP mVEP (f, a) (37) ≤ γ max f ∈FVEP,a∈A Q k-1 mVEP (f, a) -Q π * mVEP mVEP (f, a) (38) = γ Q k-1 mVEP -Q π * mVEP mVEP ∞ . ( ) Unrolling the last inequality k times, we obtain: Q k mVEP -Q π * mVEP mVEP ∞ ≤ γ k ||Q 0 mVEP -Q π * mVEP mVEP || (40) ≤ γ k 1 -γ . Bounding Term (ii): Q π * mVEP mVEP -Q * mVEP (f, a) = γ pVEP V π * mVEP mVEP (f, a) -γp VEP V * mVEP (f, a) (42) = γ (p VEP -p VEP ) V * mVEP (f, a) -γ pVEP V π * mVEP mVEP -V * mVEP (f, a) (43) = γ (p VEP -p VEP ) V * mVEP (f, a) (44) -γ f ′ ∈F pVEP (f, a, f ′ )(max a ′ ∈A Q π * mVEP mVEP (f ′ , a ′ ) -max a ′ ∈A Q * mVEP (f ′ , a ′ )). Therefore, Q π * mVEP mVEP -Q * mVEP ∞ ≤ γ max f ∈FVEP,a∈A (p VEP -p VEP ) V * mVEP (f, a) + γ Q π * mVEP mVEP -Q * mVEP ∞ (45) ≤ γ 1 -γ (p VEP -p VEP ) V * mVEP ∞ . Fix a (f, a) pair: (p VEP -p VEP ) V * mVEP = 1 N N i=1 V * mVEP (f ′ i ) -E f ′ ∈pVEP(f,a,f ′ ) V * mVEP (f ′ ) (47) = 1 N (S N -E[S N ]), where S N = N i=1 X i and X i = V * mVEP (f ′ i ). X i are random independent variables and |X i | ≤ 1 1-γ . Applying Hoeffding's inequality, we obtain ∀t > 0: p 1 N (S N -E[S N ]) ≥ t ≤ 2 exp -N 2 t 2 N/(1 -γ) 2 (49) = 2 exp -N t 2 (1 -γ) 2 (50) p max f ∈FVEP,a∈A (p VEP -p VEP ) V * mVEP (f, a) ≥ t = p ∃(f, a) s.t. (p VEP -p VEP ) V * mVEP (f, a) ≥ t (51) ≤ f ∈F ,a∈A p (p VEP -p VEP ) V * mVEP (f, a) ≥ t (Union Bound) = 2|F VEP ||A| exp -N t 2 (1 -γ) 2 Let the failure probability δ > 0. Solve for t, 2|F VEP ||A| exp -N t 2 (1 -γ) 2 = t (53) ⇒ t = 1 1 -γ log(2|F VEP ||A|/δ) N . With probability at least 1 -δ, Q π * mVEP mVEP -Q * mVEP ∞ ≤ γ 1 -γ max f ∈FVEP,a∈A (p VEP -p VEP ) V * mVEP ∞ (55) ≤ γ (1 -γ) 2 log(2|F VEP ||A|/δ) N . We conclude Q k mVEP -Q * mVEP ∞ ≤ Q k mVEP -Q π * mVEP mVEP ∞ + Q π * mVEP mVEP -Q * mVEP ∞ (57) ≤ γ k (1 -γ) + γ (1 -γ) 2 log(2|F VEP ||A|/δ) N . By choosing k = log(2(1 -γ)/ε) log γ and N = 4γ 2 (1 -γ) 4 ε 2 log(2|F VEP ||A|/δ), we get Q k mVEP -Q * mVEP ∞ ≤ ε/2 + ε/2 = ε. Therefore, the total number of samples (calls to the generative model) to get an ε estimation of the optimal Q-value is: N |F VEP ||A| = O |F VEP ||A| (1 -γ) 4 ε 2 . ( ) B ALGORITHM PSEUDOCODES Algorithm 1 Model-Based Q-Value Iteration 1: Initialize the parameters V 0 = 0 and Q 0 = 0 2: for episode k = 1, . . . , K do 3: for (f, a) ∈ F × A do 4: Q k (f, a) = r(f, a) + γ pV k-1 (f, a) 5: V k (f ) = max a∈A Q k (f, a) 6: end for 7: end for 8: Return Q K Algorithm 2 The Straight-Forward Decision-Time Planning Algorithm of Zhao et al. (2021) 1 while not done do 13: : Initialize the parameters θ, η & ω of ϕ θ : S → F, Q η : F × A → R & m ω = (p ω , A ← ϵ-greedy(tree search with bootstrapping(ϕ θ (S), m ω , Q η , n s , h))  i ← i + 1 23: end while 24: Return ϕ θ , Q η & m ω Note that Alg. 2foot_2 does not employ the "bottleneck mechanism" introduced in (Zhao et al., 2021) .

C EXPERIMENTAL DETAILS

In this section, we provide the implementation details of the environments that are used in Sec. 5 together with the details of the models that are used in the scalability experiments of Sec. 5.1. We also provide the implementation details of the straightforward decision-time planning algorithm of Zhao et al. (2021) that was used in Sec. 5.2.

C.1 IMPLEMENTATION DETAILS OF THE SW ENVIRONMENT

As stated in Sec. 3, in the Squirrel's World (SW) environment the squirrel's job is to navigate from cell E1 (its initial state) to cell E16 (the terminal state) to pickup the nut without getting caught by the hawk that flies back and forth horizontally along row C. At each time step, the squirrel receives as input an 5×16 image of the current state of the environment and then, through the use of a predefined state encoder, transforms this image into a feature vector that contains information regarding all aspects of the current state of the environment, i.e., the feature vector contains information on the current position of the squirrel and the cloud, the current wind direction in rows A and B, the current position and direction of the hawk and the current weather condition. Based on this, the squirrel selects an action that either moves it to the left or right cell, or keeps it position fixed (except if the agent is trying to move out of the boundaries of the world in which case its position is kept constant). If the squirrel gets caught by the hawk or if it is out of time, it receives a reward of 0 and the episode terminates, and if the squirrel successfully navigates to the nut within the given time limit, it gets a reward of +10 and the episode terminates. The agent-environment interaction lasts for 100 time steps, after which the agent receives a done signal, marking the end of the episode.

C.2 IMPLEMENTATION DETAILS OF THE 2RDO ENVIRONMENTS

In the 2RDO environments, the agent, depicted by the red triangle, spawns in top-left of the top room and has to navigate to the green goal cell located in the bottom-right of the same room, regardless of the gaseous motions of the obstacles in the bottom room. Here, at each time step, the obstacles move to one of its neighboring cells (except if it is trying to move out of the boundaries of the world in which case its position is kept constant). At each time step, the agent receives an image of the current state of the grid and then, through the use of a learned state encoder, transforms this image into a feature vector. Based on this, the agent selects an action that either turns it left or right, or moves it forward (except if the agent is trying to move out of the boundaries of the world in which case its position is kept constant). If the agent successfully navigates to the goal cell within the given time limit, it receives a reward of +1 and the episode terminates. The agent-environment interaction lasts for 50 time steps for the 8x8 environments and 100 time steps for the 16x16 environments, after which the agent receives a done signal, marking the end of the episode.

C.3 DETAILS OF THE HAND-ENGINEERED MODELS

The details of what the models in Sec. 5.1 model can be found in Table 1 . Table 1 : Several non-VE and VE partial models of the SW environment. 2021) that we have used can be found in Table 2 . 



Note that |FVEP| also affects |Πr VEP |, i.e., as |FVEP| grows, |Πr VEP | also grows. RELATED WORK See https://github.com/mila-iqia/Conscious-Planning for the publicly available actual code.



Figure 1: The SW environment.

Figure 2: Variations of the 2RDO environment with grid sizes of 8x8 and 16x16. In these environments, there are either no obstacles (c), or there are several obstacles (balls and boxes) with different colors (a, b, d, e).

Figure 3: The (a) value losses, (b) planning losses, and (c) planning times of several models. Plot (a) was obtained over a single run and plots (b) and (c) were obtained by averaging over 50 runs per model.

Figure 4: The total reward obtained as a result of planning with models m4 and m7 on the (a) Det-SW and (b) Stoch-SW environments. Shaded regions are standard errors over 50 runs.

Figure 5: The total steps to reach the goal in the 8x8 and 16x16 versions of the (a, c) BlueBalls, (b, d)NoObstacles, (e, i) RedBalls, (f, j) GreyBalls, (g, k) RedBoxes and (h, l) GreyBoxes environments for the AREG, AVES and AVES+ME agents. Black dashed lines indicate the performance of the optimal policy in the corresponding environments. Shaded regions are standard errors over 100 runs.

Figure 6: The total steps to reach the goal in the 16x16 BlueBalls environment for the AREG and AVES+ME agents with search budgets of 20, 40 and 80. Black dashed lines indicate the performance of the optimal policy in the corresponding environments. Shaded regions are standard errors over 100 runs.

Let m VEP ∈ M VEP be a VE partial model of the true environment m * ∈ M. Let mVEP ∈ M VEP be the corresponding approximate VE partial model that has the same reward function as m VEP , but whose transition distribution is estimated by m calls to the generative model m VEP , where

+ {(S, A, R, S ′ , done)} 16: if |B| ≥ N rbt then 17: D ← sample batch(B, n bs , T ) 18: Update ϕ θ , Q η & m ω with D

m 1 squirrel position, cloud position m 2 squirrel position, cloud position, wind direction m 3 squirrel position, cloud position, wind direction, hawk position m 4 squirrel position, hawk position, hawk direction m 5 squirrel position, hawk position, hawk direction, cloud position m 6 squirrel position, hawk position, hawk direction, cloud position, wind direction m 7 squirrel position, hawk position, hawk direction, cloud position, wind direction, weather C.4 DETAILS AND HYPERPARAMETERS OF THE DECISION-TIME PLANNING ALGORITHM The details and hyperparameters of the straightforward decision-time of Zhao et al. (

F, let F P ⊂ F s.t. |F P | < |F|. Let F P be a space of feature vectors in which the feature vectors consist the features in F P . We say that a model m P is a partial model of the true environment m * ∈ M if it is defined over the feature vector space F P , i.e., m P ∈ M P ≡ {(p P , r P ) | p P :

that contains value-equivalent models of the true environment m * ∈ M. Given a set of features F, let F VEP ⊂ F s.t. |F VEP | < |F| and F R ⊆ F VEP . Let F VEP be a space of feature vectors in which the feature vectors consist of the features in F VEP . Let m VEP be a partial model that is defined over the feature vector space F VEP , i.e., m VEP ∈ M VEP ≡ {(p VEP , r VEP ) | p VEP :

r ω ) 2: Initialize the replay buffer B ← {} 3: N ple ← number of episodes to perform planning and learning 4: N rbt ← number of samples that the replay buffer must hold to perform planning and learning 5: n s ← number of time steps to perform search 6: n bs ← number of samples to sample from the replay buffer 7: h ← search heuristic 8: T ← replay buffer sampling strategy 9: i ← 0 10: while i < N ple do

Details and hyperparameters of Alg. 2.

annex

For more details (such as the NN architectures, replay buffer sizes, learning rates, exact details of the tree search, . . . ), we refer the reader to the publicly available code and the supplementary material of Zhao et al. (2021) .

C.5 DETAILS OF THE ENCODER SHAPING PROCEDURE DURING TRAINING

In Sec. 5.2, we argued that one of the important inductive biases that is likely to guide the agent in coming up with only the relevant features of the environment is to only let the value estimator shape the encoder and to prevent the model from doing so. This is pictorially depicted in Fig. 7 . 

