EFFICIENT OFFLINE POLICY OPTIMIZATION WITH A LEARNED MODEL

Abstract

MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. For good performance, MCTS requires accurate learned models and a large number of simulations, thus costing huge computing time. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline RL settings, including 1) learning with limited data coverage; 2) learning from offline data of stochastic environments; 3) improperly parameterized models given the offline data; 4) with a low compute budget. We propose to use a regularized one-step look-ahead approach to tackle the above issues. Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy improvements are towards the direction that maximizes the estimated advantage with regularization of the dataset. We conduct extensive empirical studies with BSuite environments to verify the hypotheses and then run our algorithm on the RL Unplugged Atari benchmark. Experimental results show that our proposed approach achieves stable performance even with an inaccurate learned model. On the large-scale Atari benchmark, the proposed method outperforms MuZero Unplugged by 43%. Most significantly, it uses only 5.6% wall-clock time (i.e., 1 hour) compared to MuZero Unplugged (i.e., 17.8 hours) to achieve a 150% IQM normalized score with the same hardware and software stacks. Our implementation is open-sourced at https://github.com/sail-sg/rosmo.

1. INTRODUCTION

Offline Reinforcement Learning (offline RL) (Levine et al., 2020 ) is aimed at learning highly rewarding policies exclusively from collected static experiences, without requiring the agent's interactions with the environment that may be costly or even unsafe. It significantly enlarges the application potential of reinforcement learning especially in domains like robotics and health care (Haarnoja et al., 2018; Gottesman et al., 2019) , but is very challenging. By only relying on static datasets for value or policy learning, the agent in offline RL is prone to action-value over-estimation or improper extrapolation at out-of-distribution (OOD) regions. Previous works (Kumar et al., 2020; Wang et al., 2020; Siegel et al., 2020) address these issues by imposing specific value penalties or policy constraints, achieving encouraging results. Model-based reinforcement learning (MBRL) approaches have demonstrated effectiveness in offline RL problems (Kidambi et al., 2020; Yu et al., 2020; Schrittwieser et al., 2021) . By modeling dynamics and planning, MBRL learns as much as possible from the data, and is generally more data-efficient than the model-free methods. We are especially interested in the state-of-the-art MBRL algorithm for offline RL, i.e., MuZero Unplugged (Schrittwieser et al., 2021) , which is a simple extension of its online RL predecessor MuZero (Schrittwieser et al., 2020) . MuZero Unplugged learns the dynamics and conducts Monte-Carlo Tree Search (MCTS) (Coulom, 2006; Kocsis & Szepesvári, 2006) planning with the learned model to improve the value and policy in a fully offline setting. In this work, we first scrutinize the MuZero Unplugged algorithm by empirically validating hypotheses about when and how the MuZero Unplugged algorithm could fail in offline RL settings.

