VARIATIONAL LATENT BRANCHING MODEL FOR OFF-POLICY EVALUATION

Abstract

Model-based methods have recently shown great potential for off-policy evaluation (OPE); offline trajectories induced by behavioral policies are fitted to transitions of Markov decision processes (MDPs), which are used to rollout simulated trajectories and estimate the performance of policies. Model-based OPE methods face two key challenges. First, as offline trajectories are usually fixed, they tend to cover limited state and action space. Second, the performance of model-based methods can be sensitive to the initialization of their parameters. In this work, we propose the variational latent branching model (VLBM) to learn the transition function of MDPs by formulating the environmental dynamics as a compact latent space, from which the next states and rewards are then sampled. Specifically, VLBM leverages and extends the variational inference framework with the recurrent state alignment (RSA), which is designed to capture as much information underlying the limited training data, by smoothing out the information flow between the variational (encoding) and generative (decoding) part of VLBM. Moreover, we also introduce the branching architecture to improve the model's robustness against randomly initialized model weights. The effectiveness of the VLBM is evaluated on the deep OPE (DOPE) benchmark, from which the training trajectories are designed to result in varied coverage of the state-action space. We show that the VLBM outperforms existing state-of-the-art OPE methods in general.

1. INTRODUCTION

Off-policy evaluation (OPE) allows for evaluation of reinforcement learning (RL) policies without online interactions. It is applicable to many domains where on-policy data collection could be prevented due to efficiency and safety concerns, e.g., healthcare (Gao et al., 2022c; a; Tang & Wiens, 2021) , recommendation systems (Mehrotra et al., 2018; Li et al., 2011 ), education (Mandel et al., 2014) , social science (Segal et al., 2018) and optimal control (Silver et al., 2016; Vinyals et al., 2019; Gao et al., 2020a; 2019; 2020b) . Recently, as reported in the deep OPE (DOPE) benchmark (Fu et al., 2020b ), model-based OPE methods, leveraging feed-forward (Fu et al., 2020b) and auto-regressive (AR) (Zhang et al., 2020a) architectures, have shown promising results toward estimating the return of target policies, by fitting transition functions of MDPs. However, model-based OPE methods remain challenged as they can only be trained using offline trajectory data, which often offers limited coverage of state and action space. Thus, they may perform sub-optimally on tasks where parts of the dynamics are not fully explored (Fu et al., 2020b) . Moreover, different initialization of the model weights could lead to varied evaluation performance (Hanin & Rolnick, 2018; Rossi et al., 2019) , reducing the robustness of downstream OPE estimations. Some approaches in RL policy optimization literature use latent models trained to capture a compact space from which the dynamics underlying MDPs are extrapolated; this allows learning expressive representations over the state-action space. However, such approaches usually require online data collections as the focus is on quickly navigating to the high-reward regions (Rybkin et al., 2021) , as well as on improving coverage of the explored state and action space (Zhang et al., 2019; Hafner et al., 2019; 2020a) or sample efficiency (Lee et al., 2020) . In this work, we propose the variational latent branching model (VLBM), aiming to learn a compact and disentangled latent representation space from offline trajectories, which can better capture the dynamics underlying environments. VLBM enriches the architectures and optimization objectives for existing latent modeling frameworks, allowing them to learn from a fixed set of offline trajectories. Specifically, VLBM considers learning variational (encoding) and generative (decoding) distributions, both represented by long short-term memories (LSTMs) with reparameterization (Kingma & Welling, 2013) , to encode the state-action pairs and enforce the transitions over the latent space, respectively. To train such models, we optimize over the evidence lower bound (ELBO) jointly with a recurrent state alignment (RSA) term defined over the LSTM states; this ensures that the information encoded into the latent space can be effectively teased out by the decoder. Then, we introduce the branching architecture that allows for multiple decoders to jointly infer from the latent space and reach a consensus, from which the next state and reward are generated. This is designed to mitigate the side effects of model-based methods where different weight initializations could lead to varied performance (Fu et al., 2020b; Hanin & Rolnick, 2018; Rossi et al., 2019) . We focus on using the VLBM to facilitate OPE since it allows to better distinguish the improvements made upon learning dynamics underlying the MDP used for estimating policy returns, as opposed to RL training where performance can be affected by multiple factors, e.g., techniques used for exploration and policy optimization. Moreover, model-based OPE methods is helpful for evaluating the safety and efficacy of RL-based controllers before deployments in the real world (Gao et al., 2022b) , e.g., how a surgical robot would react to states that are critical to a successful procedure. The key contributions of this paper are summarized as follows: (i) to the best of our knowledge, the VLBM is the first method that leverages variational inference for OPE. It can be trained using offline trajectories and capture environment dynamics over latent space, as well as estimate returns of target (evaluation) policies accurately. (ii) The design of the RSA loss term and branching architecture can effectively smooth the information flow in the latent space shared by the encoder and decoder, increasing the expressiveness and robustness of the model. This is empirically shown in experiments by comparing with ablation baselines. (iii) Our method generally outperforms existing model-based and model-free OPE methods, for evaluating policies over various D4RL environments (Fu et al., 2020a) . Specifically, we follow guidelines provided by the DOPE benchmark (Fu et al., 2020b) , which contains challenging OPE tasks where the training trajectories include varying levels of coverage of the state-action space, and target policies are designed toward resulting in state-action distributions different from the ones induced by behavioral policies.

2. VARIATIONAL LATENT BRANCHING MODEL

In this section, we first introduce the objective of OPE and the variational latent model (VLM) we consider. Then, we propose the recurrent state alignment (RSA) term as well as the branching architecture that constitute the variational latent branching model (VLBM).

2.1. OPE OBJECTIVE

We first introduce the MDP used to characterize the environment. Specifically, an MDP can be defined as a tuple M = (S, A, P, R, s 0 , γ), where S is the set of states, A the set of actions, P : S × A → S is the transition distribution usually captured by probabilities p(s t |s t-1 , a t-1 ), R : S × A → R is the reward function, s 0 is the initial state sampled from the initial state distribution p(s 0 ), γ ∈ [0, 1) is the discounting factor. Finally, the agent interacts with the MDP following some policy π(a|s) which defines the probabilities of taking action a at state s. Then, the goal of OPE can be formulated as follows. Given trajectories collected by a behavioral policy β, ρ β = {[(s 0 , a 0 , r 0 , s 1 ), . . . , (s T -1 , a T -1 , r T -1 , s T )] (0) , [(s 0 , a 0 , r 0 , s 1 ), . . . ] (foot_0) , . . . |a t ∼ β(a t |s t )} 1 , estimate the expected total return over the unknown state-action visitation distribution ρ π of the target (evaluation) policy πi.e., for T being the horizon, E (s,a)∼ρ π ,r∼R T t=0 γ t R(s t , a t ) . (1)

2.2. VARIATIONAL LATENT MODEL

We consider the VLM consisting of a prior p(z) over the latent variables z ∈ Z ⊂ R l , with Z representing the latent space and l the dimension, along with a variational encoder q ψ (z t |z t-1 , a t-1 , s t )



We slightly abuse the notation ρ β , to represent either the trajectories or state-action visitation distribution under the behavioral policy, depending on the context.

