THE BENEFITS OF MODEL-BASED GENERALIZATION IN REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

Model-Based Reinforcement Learning (RL) is widely believed to have the potential to improve sample efficiency by allowing an agent to synthesize large amounts of imagined experience. Experience Replay (ER) can be considered a simple kind of model, which has proved extremely effective at improving the stability and efficiency of deep RL. In principle, a learned parametric model could improve on ER by generalizing from real experience to augment the dataset with additional plausible experience. However, owing to the many design choices involved in empirically successful algorithms, it can be very hard to establish where the benefits are actually coming from. Here, we provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful. First, we provide a general theorem motivating how learning a model as an intermediate step can narrow down the set of possible value functions more than learning a value function directly from data using the Bellman equation. Second, we provide an illustrative example showing empirically how a similar effect occurs in a more concrete setting with neural network function approximation. Finally, we provide extensive experiments showing the benefit of model-based learning for online RL in environments with combinatorial complexity, but factored structure that allows a learned model to generalize. In these experiments, we take care to control for other factors in order to isolate, insofar as possible, the benefit of using experience generated by a learned model relative to ER alone.



Model-based reinforcement learning (RL) refers to the class of RL algorithms which learn a model of the world as an intermediate step to policy optimization. One important way such models can be used, which will be the focus of this work, is to generate imagined experience for training an agent's policy (Werbos, 1987; Munro, 1987; Jordan, 1988; Sutton, 1990; Schmidhuber, 1990) . Experience Replay (ER) can be seen as a simple, nonparameteric, model (Lin, 1992; van Hasselt et al., 2019) where experienced interactions are directly stored, and later replayed, for use in a learning update. ER already captures many of the benefits associated with a learned model as compared to modelfree incremental online algorithms (i.e. model-free algorithms which perform a learning update using each transition only at the time it is experienced). In particular, ER allows value to be rapidly propagated from states to their predecessors along previously observed transitions, without the need to actually revisit a particular transition for each step of value propagation. Propagating value only at the time a transition is visited can make model-free incremental online algorithms extremely wasteful of data, particularly in environments where the reward signal is sparse. As Lin (1992) and van Hasselt et al. (2019) have discussed, it is often not obvious why we'd expect experience generated by a learned model to improve upon ER, as a replay buffer is essentially a perfect model of the world insofar as the agent has observed it. This is especially true in the tabular case, where a model does not generalize from the observed transitions. It is also true for policy evaluation in the case where the value function and model are linear (Sutton et al., 2012) . In this case, learning the least-squares linear model from the data, and then finding the TD(0) solution (Sutton, 1988) in the resulting linear MDP is identical to finding the TD(0) solution for the empirical MDP induced by the observed data. Hence, if we expect to obtain a sample efficiency benefit by using data generated by a learned model compared to ER, we should look beyond these cases. tree lying across a stream which allows it to cross in order to reach food on the other side. Imagine the same agent has previously pushed against a decaying tree and knocked it over, but in that case gained nothing by doing so. We may then expect that upon seeing a decaying tree standing near a stream with food on the other side, the agent would be able to synthesize these two experiences to decide to intentionally push the tree over to provide a bridge over the stream. We argue that ER alone is unlikely to achieve the kind of generalization necessary to perform such extrapolation. To understand why, let's express the state as a combination of two abstract binary factors (tree fallen, food across stream). The agent has observed state (1,1), and also observed separately that from state (0,0) it can achieve state (1,0) through a certain action (call that action push-tree). Now, the agent observes state (0,1). An agent using ER can easily learn that (1,1) is valuable since it is followed by a food reward, but since it has never executed push-tree from (0,1) to reach that valuable state, there is no reason to expect it will assign that action an elevated value. On the other hand, a learned model that generalizes appropriately could guess, perhaps based on some inductive bias toward factored dynamicsfoot_0 , that taking action push-tree in (0,1) would lead to (1,1) in the next step. Having done so, we could then use the model in various ways to incorporate this information into the value function and/or policy. One common way of using the model, that we will focus on in this work, is to generate (potentially multi-step) rollouts of simulated experience and train a value function or policy on the resulting trajectories as if they were real. Having discussed how learned model generalization can provide a benefit over ER, it's worth noting that parametric value functions also generalize. Why should model generalization be inherently better than value function generalization? This question was already raised in the work of Lin (1992), which first introduced ER for RL. The next section gives a partial answer as a theorem which shows how learning a model as an intermediate step can narrow the space of possible value functions more than learning a value function directly from the data using the Bellman equation. After motivating the benefit of model-based generalization theoretically, we will present an intuitive case where learning a parametric model is empirically beneficial with NN function approximation. Subsequently, we will present extensive experiments, in environments with factored structure, which highlight the sample-efficiency benefits of model-based learning for online RL. We will also analyze an interesting instance we came across during these experiments where an agent using a learned model outperforms one using the perfect model due to smoothed reward and transition dynamics. We use the MDP formalism throughout this paper. We refer the unfamiliar reader to Sutton & Barto (2020) for an accessible discussion of MDPs and their usage in RL.

1. UNDERSTANDING THE BENEFIT OF MODEL-BASED GENERALIZATION

We now present a simple theorem that provides at least part of the answer to the question of how model generalization can be considered more useful than value function generalization. We state the theorem informally here, and formally with proof in Appendix A. Intuitively Theorem 1 states that, if we want to narrow down the possible optimal action-value functions from data, we can in general prune more if we narrow down the possible models first than if we only demand that the value functions obey the Bellman optimality equation with respect to the observed data. However, part 3 states that this is not the case for a tabular model class, but rather only



Where the state consists of a set of state-variables such that the distribution of values of each variable at the next time-steps depends on only a subset of the variables in the current state.



Consider a class of deterministic, episodic, MDPs M with fixed reward function, and transition function belonging to some known hypothesis class. Let H Q be the associated class of optimal action-value functions for MDPs in M. Now consider a dataset D of transitions. Let H B (D) be the subclass of action-value functions in H Q which obey the Bellman optimality equation for the transitions in D and let H M (D) be the subclass of optimal action-value function of MDPs in M which are consistent with D. Then the following are true: 1. H M (D) ⊆ H B (D). 2. For some choices of M and D, H M (D) ⊂ H B (D). 3. For a tabular transition function class, that is one that includes every possible mapping from state-action pairs to next states, H M (D)=H B (D).

