THE BENEFITS OF MODEL-BASED GENERALIZATION IN REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

Model-Based Reinforcement Learning (RL) is widely believed to have the potential to improve sample efficiency by allowing an agent to synthesize large amounts of imagined experience. Experience Replay (ER) can be considered a simple kind of model, which has proved extremely effective at improving the stability and efficiency of deep RL. In principle, a learned parametric model could improve on ER by generalizing from real experience to augment the dataset with additional plausible experience. However, owing to the many design choices involved in empirically successful algorithms, it can be very hard to establish where the benefits are actually coming from. Here, we provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful. First, we provide a general theorem motivating how learning a model as an intermediate step can narrow down the set of possible value functions more than learning a value function directly from data using the Bellman equation. Second, we provide an illustrative example showing empirically how a similar effect occurs in a more concrete setting with neural network function approximation. Finally, we provide extensive experiments showing the benefit of model-based learning for online RL in environments with combinatorial complexity, but factored structure that allows a learned model to generalize. In these experiments, we take care to control for other factors in order to isolate, insofar as possible, the benefit of using experience generated by a learned model relative to ER alone.



Model-based reinforcement learning (RL) refers to the class of RL algorithms which learn a model of the world as an intermediate step to policy optimization. One important way such models can be used, which will be the focus of this work, is to generate imagined experience for training an agent's policy (Werbos, 1987; Munro, 1987; Jordan, 1988; Sutton, 1990; Schmidhuber, 1990) . Experience Replay (ER) can be seen as a simple, nonparameteric, model (Lin, 1992; van Hasselt et al., 2019) where experienced interactions are directly stored, and later replayed, for use in a learning update. ER already captures many of the benefits associated with a learned model as compared to modelfree incremental online algorithms (i.e. model-free algorithms which perform a learning update using each transition only at the time it is experienced). In particular, ER allows value to be rapidly propagated from states to their predecessors along previously observed transitions, without the need to actually revisit a particular transition for each step of value propagation. Propagating value only at the time a transition is visited can make model-free incremental online algorithms extremely wasteful of data, particularly in environments where the reward signal is sparse. As Lin (1992) and van Hasselt et al. ( 2019) have discussed, it is often not obvious why we'd expect experience generated by a learned model to improve upon ER, as a replay buffer is essentially a perfect model of the world insofar as the agent has observed it. This is especially true in the tabular case, where a model does not generalize from the observed transitions. It is also true for policy evaluation in the case where the value function and model are linear (Sutton et al., 2012) . In this case, learning the least-squares linear model from the data, and then finding the TD(0) solution (Sutton, 1988) in the resulting linear MDP is identical to finding the TD(0) solution for the empirical MDP induced by the observed data. Hence, if we expect to obtain a sample efficiency benefit by using data generated by a learned model compared to ER, we should look beyond these cases. Let's now consider a situation where a model-free agent using ER will likely fail to generalize from experience in a way that an intelligent agent should be able to. Imagine an agent has witnessed a

