VARIATIONAL LATENT BRANCHING MODEL FOR OFF-POLICY EVALUATION

Abstract

Model-based methods have recently shown great potential for off-policy evaluation (OPE); offline trajectories induced by behavioral policies are fitted to transitions of Markov decision processes (MDPs), which are used to rollout simulated trajectories and estimate the performance of policies. Model-based OPE methods face two key challenges. First, as offline trajectories are usually fixed, they tend to cover limited state and action space. Second, the performance of model-based methods can be sensitive to the initialization of their parameters. In this work, we propose the variational latent branching model (VLBM) to learn the transition function of MDPs by formulating the environmental dynamics as a compact latent space, from which the next states and rewards are then sampled. Specifically, VLBM leverages and extends the variational inference framework with the recurrent state alignment (RSA), which is designed to capture as much information underlying the limited training data, by smoothing out the information flow between the variational (encoding) and generative (decoding) part of VLBM. Moreover, we also introduce the branching architecture to improve the model's robustness against randomly initialized model weights. The effectiveness of the VLBM is evaluated on the deep OPE (DOPE) benchmark, from which the training trajectories are designed to result in varied coverage of the state-action space. We show that the VLBM outperforms existing state-of-the-art OPE methods in general.

1. INTRODUCTION

Off-policy evaluation (OPE) allows for evaluation of reinforcement learning (RL) policies without online interactions. It is applicable to many domains where on-policy data collection could be prevented due to efficiency and safety concerns, e.g., healthcare (Gao et al., 2022c; a; Tang & Wiens, 2021) , recommendation systems (Mehrotra et al., 2018; Li et al., 2011 ), education (Mandel et al., 2014 ), social science (Segal et al., 2018) and optimal control (Silver et al., 2016; Vinyals et al., 2019; Gao et al., 2020a; 2019; 2020b) . Recently, as reported in the deep OPE (DOPE) benchmark (Fu et al., 2020b ), model-based OPE methods, leveraging feed-forward (Fu et al., 2020b) and auto-regressive (AR) (Zhang et al., 2020a) architectures, have shown promising results toward estimating the return of target policies, by fitting transition functions of MDPs. However, model-based OPE methods remain challenged as they can only be trained using offline trajectory data, which often offers limited coverage of state and action space. Thus, they may perform sub-optimally on tasks where parts of the dynamics are not fully explored (Fu et al., 2020b) . Moreover, different initialization of the model weights could lead to varied evaluation performance (Hanin & Rolnick, 2018; Rossi et al., 2019) , reducing the robustness of downstream OPE estimations. Some approaches in RL policy optimization literature use latent models trained to capture a compact space from which the dynamics underlying MDPs are extrapolated; this allows learning expressive representations over the state-action space. However, such approaches usually require online data collections as the focus is on quickly navigating to the high-reward regions (Rybkin et al., 2021) , as well as on improving coverage of the explored state and action space (Zhang et al., 2019; Hafner et al., 2019; 2020a) or sample efficiency (Lee et al., 2020) . In this work, we propose the variational latent branching model (VLBM), aiming to learn a compact and disentangled latent representation space from offline trajectories, which can better capture the

