DIMINISHING RETURN OF VALUE EXPANSION METH-ODS IN MODEL-BASED REINFORCEMENT LEARNING

Abstract

Model-based reinforcement learning is one approach to increase sample efficiency. However, the accuracy of the dynamics model and the resulting compounding error over modelled trajectories are commonly regarded as key limitations. A natural question to ask is: How much more sample efficiency can be gained by improving the learned dynamics models? Our paper empirically answers this question for the class of model-based value expansion methods in continuous control problems. Value expansion methods should benefit from increased model accuracy by enabling longer rollout horizons and better value function approximations. Our empirical study, which leverages oracle dynamics models to avoid compounding model errors, shows that (1) longer horizons increase sample efficiency, but the gain in improvement decreases with each additional expansion step, and (2) the increased model accuracy only marginally increases the sample efficiency compared to learned models with identical horizons. Therefore, longer horizons and increased model accuracy yield diminishing returns in terms of sample efficiency. These improvements in sample efficiency are particularly disappointing when compared to model-free value expansion methods. Even though they introduce no computational overhead, we find their performance to be on-par with model-based value expansion methods. Therefore, we conclude that the limitation of model-based value expansion methods is not the model accuracy of the learned models. While higher model accuracy is beneficial, our experiments show that even a perfect model will not provide an un-rivalled sample efficiency but that the bottleneck lies elsewhere.

1. INTRODUCTION

Insufficient sample efficiency is a central issue that prevents reinforcement learning (RL) agents from learning in physical environments. Especially in applications like robotics, samples from the real system are particularly scarce and expensive to acquire due to the high cost of operating robots. A technique that has proven to substantially enhance sample efficiency is model-based reinforcement learning (Deisenroth et al., 2013) . In model-based RL, a model of the system dynamics is usually learned from data, which is subsequently used for planning (Chua et al., 2018; Hafner et al., 2019) or for policy learning (Sutton, 1990; Janner et al., 2019) . Over the years, the model-based RL community has identified several ways of applying (learned) dynamics models in the RL framework. Sutton's (1990) DynaQ framework uses the model for data augmentation, where model rollouts are started from real environment states coming from a replay buffer. The collected data is afterwards given to a model-free RL agent. Recently, various improvements have been proposed to the original DynaQ algorithm, such as using ensemble neural network models and short rollout horizons (Janner et al., 2019; Lai et al., 2020) , and improving the synthetic data generation with model predictive control (Morgan et al., 2021) . Feinberg et al. (2018) proposed an alternative way of incorporating a dynamics model into the RL framework. Their model-based value expansion (MVE) algorithm unrolls the dynamics model and discounts along the modelled trajectory to approximate better targets for value function learning. Subsequent works were primarily concerned with adaptively setting the rollout horizon based on some (indirect) measure of the modelling error, e.g., model uncertainty (Buckman et al., 2018; Abbas et al., 2020) , a reconstruction loss (Wang et al., 2020) or approximating the local model error through temporal difference learning (Xiao et al., 2019) . Lastly, using backpropagation through time (BPTT) (Werbos, 1990), dynamics models have been applied to the policy improvement step to assist in computing better policy gradients. Deisenroth & Rasmussen (2011) use Gaussian process regression (Rasmussen & Williams, 2006) and moment matching to find a closed-form solution for the gradient of the trajectory loss function. Stochastic value gradients (SVG) (Heess et al., 2015) uses a model in combination with the reparametrization trick (Kingma & Welling, 2014) to propagate gradients along real environment trajectories. Others leverage the model's differentiability to directly differentiate through model trajectories (Byravan et al., 2020; Amos et al., 2021) . On a more abstract level, MVE-and SVG-type algorithms are very similar. Both learn a dynamics model and use it for H-step trajectories (where H is the number of modelled time steps) to better approximate the quantity of interest -the next state Q-function in the case of SVG and the Q-function for MVE. These value expansion methods all assume that longer rollout horizons will improve learning if the model is sufficiently accurate. The common opinion is that learning models with less prediction error may further improve the sample efficiency of RL. This argument is based on the fact that the single-step approximation error can become a substantial problem when using learned dynamics models for long trajectory rollouts. Minor modelling errors can accumulate quickly when multi-step trajectories are built by bootstrapping successive model predictions. This problem is known as compounding model error. Furthermore, most bounds in model-based RL are usually dependent on the model error (Feinberg et al., 2018) , and improvement guarantees assume model errors converging to zero. In practice, rollout horizons are often kept short to avoid significant compounding model error build-up (Janner et al., 2019) . Intuitively, using longer model horizons has the potential to exploit the benefits of the model even more. For this reason, immense research efforts have been put into building and learning better dynamics models. We can differentiate between purely engineered (white-box) models and learned (black-box) models (Nguyen-Tuong & Peters, 2011) . White-box models offer many advantages over black-box models, as they are more interpretable and, thus, more predictable. However, most real-world robotics problems are too complex to model analytically, and one has to retreat to learning (black-box) dynamics models (Chua et al., 2018; Janner et al., 2020) . Recently, authors have proposed neural network models that use physics-based inductive biases, also known as grey-box models (Lutter et al., 2019a; b; Greydanus et al., 2019) . While research has focused on improving model quality, a question that has received little attention yet is: Is model-based reinforcement learning even limited by the model quality? For Dyna-style data augmentation algorithms, the answer is yes, as these methods treat model-and real-environment samples the same. For value expansion methods, i.e., MVE-and SVG-type algorithms, the answer is unclear. Better models would enable longer horizons due to the reduced compounding model error and improve the value function approximations. However, the impact of both on the sample efficiency remains unclear. In this paper, we empirically address this question for value expansion methods. Using the true dynamics model, we empirically show that the sample efficiency does not increase significantly even when a perfect model is used. We find that increasing the rollout horizon with oracle dynamics as well as improving the value function approximation using an oracle model compared to a learned model at the same rollout horizon yields diminishing returns in improving sample efficiency. With the phrase diminishing returns, we refer to its definition in economics (Case & Fair, 1999) . In the context of value expansion methods in model-based RL, we mean that the marginal utility of better models on sample efficiency significantly decreases as the models improve in accuracy. These gains in sample efficiency of model-based value expansion are especially disappointing compared to model-free value expansion methods, e.g., Retrace (Munos et al., 2016) . While the model-free methods introduce no computational overhead, the performance is on-par compared to model-based value expansion methods. When comparing the sample efficiency of different horizons for both model-based and model-free value expansion, one can clearly see that improvement in sample efficiency at best decreases with each additional model step along a modelled trajectory. Sometimes the overall performance even decreases for longer horizons in some environments.



Intelligent Autonomous Systems, Technical University of Darmstadt, daniel.palenicek@tu-darmstadt.de. Hessian.AI, Hochschulstr. 10, 64293 Darmstadt, Germany. German Research Center for AI (DFKI), Research Department: Systems AI for Robot Learning. Centre for Cognitive Science, Hochschulstr. 10, 64293 Darmstadt, Germany.

