DEPLOYMENT-EFFICIENT REINFORCEMENT LEARN-ING VIA MODEL-BASED OFFLINE OPTIMIZATION

Abstract

Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We observe that naïvely applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient and sample-efficient algorithm. We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN), that not only performs better than or comparably as the state-of-the-art dynamic-programming-based and concurrently-proposed model-based offline approaches on existing benchmarks, but can also effectively optimize a policy offline using 10-20 times fewer data than prior works. Furthermore, the recursive application of BREMEN achieves impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 5-10 deployments, compared to typical values of hundreds to millions in standard RL baselines.

1. INTRODUCTION

Reinforcement learning (RL) algorithms have recently demonstrated impressive success in learning behaviors for a variety of sequential decision-making tasks (Barth-Maron et al., 2018; Hessel et al., 2018; Nachum et al., 2019) . Virtually all of these demonstrations have relied on highly-frequent online access to the environment, with the RL algorithms often interleaving each update to the policy with additional experience collection of that policy acting in the environment. However, in many real-world applications of RL, such as health (Murphy et al., 2001) , education (Mandel et al., 2014) , dialog agents (Jaques et al., 2019), and robotics (Gu et al., 2017a; Kalashnikov et al., 2018) , the deployment of a new data-collection policy may be associated with a number of costs and risks. If we can learn tasks with a small number of data collection policies, we can substantially reduce them. Based on this idea, we propose a novel measure of RL algorithm performance, namely deployment efficiency, which counts the number of changes in the data-collection policy during learning, as illustrated in Figure 1 . This concept may be seen in contrast to sample efficiency or data efficiency (Precup et al., 2001; Degris et al., 2012; Gu et al., 2017b; Haarnoja et al., 2018; Lillicrap et al., 2016; Nachum et al., 2018) , which measures the amount of environment interactions incurred during training, without regard to how many distinct policies were deployed to perform those interactions. Even when the

availability

models are available at https://github.com/matsuolab/

