DEPLOYMENT-EFFICIENT REINFORCEMENT LEARN-ING VIA MODEL-BASED OFFLINE OPTIMIZATION

Abstract

Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We observe that naïvely applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient and sample-efficient algorithm. We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN), that not only performs better than or comparably as the state-of-the-art dynamic-programming-based and concurrently-proposed model-based offline approaches on existing benchmarks, but can also effectively optimize a policy offline using 10-20 times fewer data than prior works. Furthermore, the recursive application of BREMEN achieves impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 5-10 deployments, compared to typical values of hundreds to millions in standard RL baselines.

1. INTRODUCTION

Reinforcement learning (RL) algorithms have recently demonstrated impressive success in learning behaviors for a variety of sequential decision-making tasks (Barth-Maron et al., 2018; Hessel et al., 2018; Nachum et al., 2019) . Virtually all of these demonstrations have relied on highly-frequent online access to the environment, with the RL algorithms often interleaving each update to the policy with additional experience collection of that policy acting in the environment. However, in many real-world applications of RL, such as health (Murphy et al., 2001 ), education (Mandel et al., 2014 ), dialog agents (Jaques et al., 2019 ), and robotics (Gu et al., 2017a; Kalashnikov et al., 2018) , the deployment of a new data-collection policy may be associated with a number of costs and risks. If we can learn tasks with a small number of data collection policies, we can substantially reduce them. Based on this idea, we propose a novel measure of RL algorithm performance, namely deployment efficiency, which counts the number of changes in the data-collection policy during learning, as illustrated in Figure 1 . This concept may be seen in contrast to sample efficiency or data efficiency (Precup et al., 2001; Degris et al., 2012; Gu et al., 2017b; Haarnoja et al., 2018; Lillicrap et al., 2016; Nachum et al., 2018) , which measures the amount of environment interactions incurred during training, without regard to how many distinct policies were deployed to perform those interactions. Even when the data efficiency is high, the deployment efficiency could be low, since many on-policy and off-policy algorithms alternate data collection with each policy update (Schulman et al., 2015; Lillicrap et al., 2016; Gu et al., 2016; Haarnoja et al., 2018) . Such dependence on high-frequency policy deployments is best illustrated in the recent works in offline RL (Fujimoto et al., 2019; Jaques et al., 2019; Kumar et al., 2019; Levine et al., 2020; Wu et al., 2019) , where baseline off-policy algorithms exhibited poor performance when trained on a static dataset. These offline RL works, however, limit their study to a single deployment, which is enough for achieving high performance with data collected from a sub-optimal behavior policy, but often not from a random policy. In contrast to those prior works, we aim to learn successful policies from scratch in a manner that is both sample and deployment-efficient. Many existing model-free offline RL algorithms (Levine et al., 2020) are tuned and evaluated on massive datasets (e.g., one million transitions). In order to develop an algorithm that is both sample and deployment-efficient, each iteration of the algorithm between successive deployments has to work effectively on much smaller dataset sizes. We believe model-based RL is better suited to this setting due to its higher demonstrated sample efficiency than model-free RL (Kurutach et al., 2018; Nagabandi et al., 2018) . Although the combination of model-based RL and offline or limiteddeployment settings seems straight-forward, we find this naïve approach leads to poor performance. This problem can be attributed to extrapolation errors (Fujimoto et al., 2019) similar to those observed in model-free methods. Specifically, the learned policy may choose sequences of actions which lead it to regions of the state space where the dynamics model cannot predict properly, due to poor coverage of the dataset. This can lead the policy to exploit approximation errors of the dynamics model and be disastrous for learning. In model-free settings, similar data distribution shift problems are typically remedied by regularizing policy updates explicitly with a divergence from the observed data distribution (Jaques et al., 2019; Kumar et al., 2019; Wu et al., 2019) , which, however, can overly limit policies' expressivity (Sohn et al., 2020) . In order to better approach these problems arising in limited deployment settings, we propose Behavior-Regularized Model-ENsemble (BREMEN), which learns an ensemble of dynamics models in conjunction with a policy using imaginary rollouts while implicitly regularizing the learned policy via appropriate parameter initialization and conservative trust-region learning updates. We evaluate BREMEN on standard offline RL benchmarks of high-dimensional continuous control tasks, where only a single static dataset is used. In this fixed-batch setting, our experiments show that BREMEN can not only achieve performance competitive with state-of-the-art when using standard dataset sizes but also learn with 10-20 times smaller datasets, which previous methods are unable to attain. Enabled by such stable and sample-efficient offline learning, we show that BREMEN can learn successful policies with only 5-10 deployments in the online setting, significantly outperforming existing off-policy and offline RL algorithms in deployment efficiency while keeping sample efficiency.



Figure1: Deployment efficiency is defined as the number of changes in the data-collection policy (I), which is vital for managing costs and risks of new policy deployment. Online RL algorithms typically require many iterations of policy deployment and data collection, which leads to extremely low deployment efficiency. In contrast, most pure offline algorithms consider updating a policy from a fixed dataset without additional deployment and often fail to learn from a randomly initialized data-collection policy. Interestingly, most state-of-the-art off-policy algorithms are still evaluated in heavily online settings. For example, SAC(Haarnoja  et al., 2018)  collects one sample per policy update, amounting to 100,000 to 1 million deployments for learning standard benchmark domains.

availability

trained models are available at https://github.com/matsuolab/BREMEN.

