VARIATIONAL LATENT BRANCHING MODEL FOR OFF-POLICY EVALUATION

Abstract

Model-based methods have recently shown great potential for off-policy evaluation (OPE); offline trajectories induced by behavioral policies are fitted to transitions of Markov decision processes (MDPs), which are used to rollout simulated trajectories and estimate the performance of policies. Model-based OPE methods face two key challenges. First, as offline trajectories are usually fixed, they tend to cover limited state and action space. Second, the performance of model-based methods can be sensitive to the initialization of their parameters. In this work, we propose the variational latent branching model (VLBM) to learn the transition function of MDPs by formulating the environmental dynamics as a compact latent space, from which the next states and rewards are then sampled. Specifically, VLBM leverages and extends the variational inference framework with the recurrent state alignment (RSA), which is designed to capture as much information underlying the limited training data, by smoothing out the information flow between the variational (encoding) and generative (decoding) part of VLBM. Moreover, we also introduce the branching architecture to improve the model's robustness against randomly initialized model weights. The effectiveness of the VLBM is evaluated on the deep OPE (DOPE) benchmark, from which the training trajectories are designed to result in varied coverage of the state-action space. We show that the VLBM outperforms existing state-of-the-art OPE methods in general.

1. INTRODUCTION

Off-policy evaluation (OPE) allows for evaluation of reinforcement learning (RL) policies without online interactions. It is applicable to many domains where on-policy data collection could be prevented due to efficiency and safety concerns, e.g., healthcare (Gao et al., 2022c; a; Tang & Wiens, 2021) , recommendation systems (Mehrotra et al., 2018; Li et al., 2011) , education (Mandel et al., 2014) , social science (Segal et al., 2018) and optimal control (Silver et al., 2016; Vinyals et al., 2019; Gao et al., 2020a; 2019; 2020b) . Recently, as reported in the deep OPE (DOPE) benchmark (Fu et al., 2020b) , model-based OPE methods, leveraging feed-forward (Fu et al., 2020b) and auto-regressive (AR) (Zhang et al., 2020a) architectures, have shown promising results toward estimating the return of target policies, by fitting transition functions of MDPs. However, model-based OPE methods remain challenged as they can only be trained using offline trajectory data, which often offers limited coverage of state and action space. Thus, they may perform sub-optimally on tasks where parts of the dynamics are not fully explored (Fu et al., 2020b) . Moreover, different initialization of the model weights could lead to varied evaluation performance (Hanin & Rolnick, 2018; Rossi et al., 2019) , reducing the robustness of downstream OPE estimations. Some approaches in RL policy optimization literature use latent models trained to capture a compact space from which the dynamics underlying MDPs are extrapolated; this allows learning expressive representations over the state-action space. However, such approaches usually require online data collections as the focus is on quickly navigating to the high-reward regions (Rybkin et al., 2021) , as well as on improving coverage of the explored state and action space (Zhang et al., 2019; Hafner et al., 2019; 2020a) or sample efficiency (Lee et al., 2020) . In this work, we propose the variational latent branching model (VLBM), aiming to learn a compact and disentangled latent representation space from offline trajectories, which can better capture the dynamics underlying environments. VLBM enriches the architectures and optimization objectives for existing latent modeling frameworks, allowing them to learn from a fixed set of offline trajectories. Specifically, VLBM considers learning variational (encoding) and generative (decoding) distributions, both represented by long short-term memories (LSTMs) with reparameterization (Kingma & Welling, 2013) , to encode the state-action pairs and enforce the transitions over the latent space, respectively. To train such models, we optimize over the evidence lower bound (ELBO) jointly with a recurrent state alignment (RSA) term defined over the LSTM states; this ensures that the information encoded into the latent space can be effectively teased out by the decoder. Then, we introduce the branching architecture that allows for multiple decoders to jointly infer from the latent space and reach a consensus, from which the next state and reward are generated. This is designed to mitigate the side effects of model-based methods where different weight initializations could lead to varied performance (Fu et al., 2020b; Hanin & Rolnick, 2018; Rossi et al., 2019) . We focus on using the VLBM to facilitate OPE since it allows to better distinguish the improvements made upon learning dynamics underlying the MDP used for estimating policy returns, as opposed to RL training where performance can be affected by multiple factors, e.g., techniques used for exploration and policy optimization. Moreover, model-based OPE methods is helpful for evaluating the safety and efficacy of RL-based controllers before deployments in the real world (Gao et al., 2022b) , e.g., how a surgical robot would react to states that are critical to a successful procedure. The key contributions of this paper are summarized as follows: (i) to the best of our knowledge, the VLBM is the first method that leverages variational inference for OPE. It can be trained using offline trajectories and capture environment dynamics over latent space, as well as estimate returns of target (evaluation) policies accurately. (ii) The design of the RSA loss term and branching architecture can effectively smooth the information flow in the latent space shared by the encoder and decoder, increasing the expressiveness and robustness of the model. This is empirically shown in experiments by comparing with ablation baselines. (iii) Our method generally outperforms existing model-based and model-free OPE methods, for evaluating policies over various D4RL environments (Fu et al., 2020a) . Specifically, we follow guidelines provided by the DOPE benchmark (Fu et al., 2020b) , which contains challenging OPE tasks where the training trajectories include varying levels of coverage of the state-action space, and target policies are designed toward resulting in state-action distributions different from the ones induced by behavioral policies.

2. VARIATIONAL LATENT BRANCHING MODEL

In this section, we first introduce the objective of OPE and the variational latent model (VLM) we consider. Then, we propose the recurrent state alignment (RSA) term as well as the branching architecture that constitute the variational latent branching model (VLBM).

2.1. OPE OBJECTIVE

We first introduce the MDP used to characterize the environment. Specifically, an MDP can be defined as a tuple M = (S, A, P, R, s 0 , γ), where S is the set of states, A the set of actions, P : S × A → S is the transition distribution usually captured by probabilities p(s t |s t-1 , a t-1 ), R : S × A → R is the reward function, s 0 is the initial state sampled from the initial state distribution p(s 0 ), γ ∈ [0, 1) is the discounting factor. Finally, the agent interacts with the MDP following some policy π(a|s) which defines the probabilities of taking action a at state s. Then, the goal of OPE can be formulated as follows. Given trajectories collected by a behavioral policy β, ρ β = {[(s 0 , a 0 , r 0 , s 1 ), . . . , (s T -1 , a T -1 , r T -1 , s T )] (0) , [(s 0 , a 0 , r 0 , s 1 ), . . . ] (1) , . . . |a t ∼ β(a t |s t )} 1 , estimate the expected total return over the unknown state-action visitation distribution ρ π of the target (evaluation) policy πi.e., for T being the horizon, E (s,a)∼ρ π ,r∼R T t=0 γ t R(s t , a t ) . (1)

2.2. VARIATIONAL LATENT MODEL

We consider the VLM consisting of a prior p(z) over the latent variables z ∈ Z ⊂ R l , with Z representing the latent space and l the dimension, along with a variational encoder q ψ (z t |z t-1 , a t-1 , s t ) and a generative decoder p ϕ (z t , s t , r t-1 |z t-1 , a t-1 ), parameterized by ψ and ϕ respectively. Basics of variational inference are introduced in Appendix F. Latent Prior p(z 0 ). The prior specifies the distribution from which the latent variable of the initial stage, z 0 , is sampled. We configure p(z 0 ) to follow a Gaussian with zero mean and identity covariance matrix, which is a common choice under the variational inference framework (Kingma & Welling, 2013; Lee et al., 2020) . z t ∈Z p(zt-1,at-1,zt,st)dzt , where the denominator requires integrating over the unknown latent space. Specifically, the encoder can be decomposed into two parts, given that q ψ (z 0:T |s 0:T , a 0:T -1 ) ℎ ! " " # " % " $ " $ ! " ! ℎ " ℎ # # ! $ # . . . . . . . . . " # % ! ℎ $ $ $ ! $%& " $%& $ $%' " $ $ $%& =q ψ (z 0 |s 0 ) T t=1 q ψ (z t |z t-1 , a t-1 , s t ); (2) here, q ψ (z 0 |s 0 ) encodes the initial state s 0 in to the corresponding latent variable z 0 , then, q ψ (z t |z t-1 , a t-1 , s t ) enforces the transition from z t-1 to z t conditioned on a t-1 and s t . Both distributions are diagonal Gaussiansfoot_1 , with means and diagonal of covariance matrices determined by multi-layered perceptron (MLP) (Bishop, 2006) and long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) respectively. The weights for both neural networks are referred to as ψ in general. Consequently, the inference process for z t can be summarized as z ψ 0 ∼ q ψ (z 0 |s 0 ), h ψ t = f ψ (h ψ t-1 , z ψ t-1 , a t-1 , s t ), z ψ t ∼ q ψ (z t |h ψ t ), where f ψ represents the LSTM layer and h ψ t the LSTM recurrent (hidden) state. Note that we use ψ in superscripts to distinguish the variables involved in this inference process, against the generative process introduced below. Moreover, reparameterization can be used to sample z ψ 0 and z ψ t , such that gradients of sampling can be back-propagated, as introduced in (Kingma & Welling, 2013) . Overview of the inference and generative processes are illustrated in Fig. 1 . Generative Decoder for Sampling p ϕ (z t , s t , r t-1 |z t-1 , a t-1 ). The decoder is used to interact with the target policies and acts as a synthetic environment during policy evaluation, from which the expected returns can be estimated as the mean return of simulated trajectories. The decoder can be represented by the multiplication of three diagonal Gaussian distributions, given that p ϕ (z 1:T , s 0:T , r 0:T -1 |z 0 , π) = T t=0 p ϕ (s t |z t ) T t=1 p ϕ (z t |z t-1 , a t-1 )p ϕ (r t-1 |z t ), with a t ∼ π(a t |s t ) at each time step. Specifically, p ϕ (z t |z t-1 , a t-1 ) has its mean and covariance determined by an LSTM, enforcing the transition from z t-1 to z t in the latent space given action a t-1 . In what follows, p ϕ (s t |z t ) and p ϕ (r t-1 |z t ) generate the current state s t and reward r t-1 given z t , whose mean and covariance are determined by MLPs. As a result, the generative process starts with sampling the initial latent variable from the latent prior, i.e., z ϕ 0 ∼ p(z 0 ). Then, the initial state s ϕ 0 ∼ p ϕ (s 0 |z ϕ 0 ) and action a 0 ∼ π(a 0 |s ϕ 0 ) are obtained from p ϕ and target policy π, respectively; the rest of generative process can be summarized as where f ϕ is the LSTM layer producing recurrent state h ϕ t . Then, an MLP g ϕ is used to generate mapping between h ϕ t and hϕ t that will be used for recurrent state alignment (RSA) introduced below, to augment the information flow between the inference and generative process. h ϕ t = f ϕ (h ϕ t-1 , z ϕ t-1 , a t-1 ), hϕ t = g ϕ (h ϕ t ), z ϕ t ∼ p ϕ ( hϕ t ), s ϕ t ∼ p ϕ (s t |z ϕ t ), r ϕ t-1 ∼ p ϕ (r t-1 |z ϕ t ), a t ∼ π(a t |s ϕ t ), Furthermore, to train the elements in the encoder (3) and decoder (5), one can maximize the evidence lower bound (ELBO), a lower bound of the joint log-likelihood p(s 0:T , r 0:T -1 ), following L ELBO (ψ, ϕ) =E q ψ T t=0 log p ϕ (s t |z t ) + T t=1 log p ϕ (r t-1 |z t ) -KL q ψ (z 0 |s 0 )||p(z 0 ) - T t=1 KL q ψ (z t |z t-1 , a t-1 , s t )||p ϕ (z t |z t-1 , a t-1 ) ; here, the first two terms represent the log-likelihood of reconstructing the states and rewards, and the last two terms regularize the approximated posterior. The proof can be found in Appendix E.

2.3. RECURRENT STATE ALIGNMENT

The latent model discussed above is somewhat reminiscent of the ones used in model-based RL policy training methods, e.g., recurrent state space model (RSSM) used in PlaNet (Hafner et al., 2019) and Dreamer (Hafner et al., 2020a; b) , as well as similar ones in Lee et al. (2020) ; Lu et al. (2022) . Such methods rely on a growing experience buffer for training, which is collected online by the target policy that is being concurrently updated (with exploration noise added); however, OPE aims to extrapolate returns from a fixed set of offline trajectories which may result in limited coverage of the state and action space. Consequently, directly applying VLM for OPE can lead to subpar performance empirically; see results in Sec. 3. Moreover, the encoder above plays a key role of capturing the temporal transitions between latent variables, i.e., p ψ (z t |z t-1 , a t-1 , s t ) from (2). However, it is absent in the generative process, as the decoder leverages a separate network to determine the latent transitions, i.e., p ϕ (z t |z t-1 , a t-1 ). Moreover, from the ELBO (6) above it can be seen that only the KL-divergence terms are used to regularize these two parts, which may not be sufficient for OPE as limited offline trajectories are provided. As a result, we introduce the RSA term as part of the training objective, to further regularize p ψ (z t |z t-1 , a t-1 , s t ) and p ϕ (z t |z t-1 , a t-1 ). A graphical illustration of RSA can be found in Fig. 2 . 3Specifically, RSA is defined as the mean pairwise squared error between h ψ t from the encoder (3) and hϕ t from the decoder (5), i.e., L RSA ( hϕ t , h ψ t ; ψ, ϕ) = 1 N N i=1 T t=0 M (M -1) 2 M -1 j=1 M k=j+1 ( hϕ t [j] -hϕ t [k]) -(h ψ t [j] -h ψ t [k]) 2 ; (7) here, we assume that both LSTM recurrent states have the same dimension hϕ t , h ψ t ∈ R M , with h (•) t [j] referring to the j-th element of the recurrent state, and N the number of training trajectories. Here, we choose the pairwise squared loss over the classic mean squared error (MSE), because MSE could be too strong to regularize h ψ t and hϕ t which support the inference and generative processes respectively and are not supposed to be exactly the same. In contrast, the pairwise loss (7) can promote structural similarity between the LSTM recurrent states of the encoder and decoder, without strictly enforcing them to become the same. Note that this design choice has been justified in Sec. 3 through an ablation study by comparing against models trained with MSE. In general, the pairwise loss has also been adopted in many domains for similar purposes, e.g., object detection (Gould et al., 2009; Rocco et al., 2018) , ranking systems (Doughty et al., 2018; Saquil et al., 2021) and contrastive learning (Wang et al., 2021; Chen et al., 2020) . Similarly, we apply the pairwise loss over h ψ t and hϕ t , instead of directly over h ψ t and h ϕ t , as the mapping g ϕ (from equation 5) could serve as a regularization layer to ensure optimality over L RSA without changing h ψ t , h ϕ t significantly. As a result, the objective for training the VLM, following architectures specified in (3) and ( 5), can be formulated as max ψ,ϕ L V LM (ψ, ϕ) = max ψ,ϕ L ELBO (ψ, ϕ) -C • L RSA ( hϕ t , h ψ t ; ψ, ϕ) , with C > 0 and C ∈ R being the constant balancing the scale of the ELBO and RSA terms.

2.4. BRANCHING FOR GENERATIVE DECODER

The performance of model-based methods can vary upon different design factors (Fu et al., 2020b; Hanin & Rolnick, 2018) . Specifically, Rossi et al. (2019) has found that the convergence speed and optimality of variational models are sensitive to the choice of weight initialization techniques. Moreover, under the typical variational inference setup followed by the VLM above, the latent transitions reconstructed by the decoder, p ϕ (z t |z t-1 , a t-1 ), are only trained through regularization losses in ( 6) and ( 7), but are fully responsible for rolling out trajectories during evaluation. Consequently, in this sub-section we introduce the branching architecture for decoder, with the goal of minimizing the impact brought by random weight initialization of the networks, and allowing the decoder to best reconstruct the latent transitions p ϕ (z t |z t-1 , a t-1 ) as well as s t 's and r t-1 's correctly. Specifically, the branching architecture leverages an ensemble of B ∈ Z + decoders to tease out information from the latent space formulated by the encoder, with final predictions sampled from a mixture of the Gaussian output distributions from (5). Note that the classic setup of ensembles is not considered, i.e., train and average over B VLMs end-to-end; because in this case B different latent space exist, each of which is still associated with a single decoder, leaving the challenges above unresolved. This design choice is justified by ablations studies in Sec. 3, by comparing VLBM against a (classic) ensemble of VLMs. Branching Architecture. Consider the generative process involving B branches of the decoders parameterized by {ϕ 1 , . . . , ϕ B }. The forward architecture over a single step is illustrated in Fig. 2 . 4Specifically, the procedure of sampling z ϕ b t and s ϕ b t for each b ∈ [1, B] follows from (5). Recall that by definition p ϕ b (s t |z ϕ b t ) follows multivariate Gaussian with mean and diagonal of covariance matrix determined by the corresponding MLPs, i.e., µ(s ϕ b t ) = ϕ M LP b,µ (z ϕ b t ) and Σ diag (s ϕ b t ) = ϕ M LP b,Σ (z ϕ b t ). In what follows, the final outcome s ϕ t can be sampled following diagonal Gaussian with mean and variance determined by weighted averaging across all branches using weights w b 's, i.e., s ϕ t ∼ p ϕ (s t |z ϕ1 t , . . . , z ϕ B t ) = N µ = b w b • µ(s ϕ b t ), Σ diag = b w 2 b • Σ diag (s ϕ b t ) . (9) The objective below can be used to jointly update, w b 's, ψ and ϕ b 's, i.e., max ψ,ϕ,w , with v b ∈ R the learnable variables and 0 < ϵ ≪ 1, ϵ ∈ R, the constant ensuring denominator to be greater than zero, to convert (10) into unconstrained optimization and solve it using gradient descent. Lastly, note that complementary latent modeling methods, e.g., latent overshooting from Hafner et al. (2019) , could be adopted in (10). However, we keep the objective straightforward, so that the source of performance improvements can be isolated. To evaluate the VLBM, we follow the guidelines from the deep OPE (DOPE) benchmark (Fu et al., 2020b) . Specifically, we follow the D4RL branch in DOPE and use the Gym-Mujoco and Adroit suites as the test base (Fu et al., 2020a) . Such environments have long horizons and highdimensional state and action space, which are usually challenging for model-based methods. The provided offline trajectories for training are collected using behavioral policies at varied scale, including limited exploration, human teleoperation etc., which can result in different levels of coverage over the state-action space. Also, the target (evaluation) policies are generated using online RL training, aiming to reduce the similarity between behavioral and target policies; it introduces another challenge that during evaluation the agent may visit states unseen from training trajectories. L V LBM (ψ, ϕ 1 , . . . , ϕ B , w 1 , . . . , w B ) = max ψ,ϕ,w T t=0 log p ϕ (s ϕ t |z ϕ1 t , . . . , z ϕ B t ) -C 1 • b L RSA ( hϕ b t , h ψ t ; ψ, ϕ b ) + C 2 b L ELBO (ψ, ϕ b ) , s.t. w 1 , . . . , w B > 0 , b w b = 1 and constants C 1 , C 2 > 0. (

3. EXPERIMENTS

Environmental and Training Setup. A total of 8 environments are provided by Gym-Mujoco and Adroit suites (Fu et al., 2020b; a) . Moreover, each environment is provided with 5 (for Gym-Mujoco) or 3 (for Adroit) training datasets collected using different behavioral policies, resulting in a total of 32 sets of env-dataset tasksfoot_4 -a full list can be found in Appendix A. DOPE also provides 11 target policies for each environment, whose performance are to be evaluated by the OPE methods. They in general result in varied scales of returns, as shown in the x-axes of Fig. 7 . Moreover, we consider the decoder to have B = 10 branches, i.e., {p ϕ1 , . . . , p ϕ10 }. The dimension of latent space is set to be 16, i.e., z ∈ Z ⊂ R 16 . Other implementation details can be found in Appendix A. Baselines and Evaluation Metrics. In addition to the five baselines reported from DOPE, i.e., importance sampling (IS) (Precup, 2000) , doubly robust (DR) (Thomas & Brunskill, 2016) , variational power method (VPM) (Wen et al., 2020) , distribution correction estimation (DICE) (Yang et al., 2020) , and fitted Q-evaluation (FQE) (Le et al., 2019) , the effectiveness of VLBM is also compared against the state-of-the-art model-based OPE method leveraging the auto-regressive (AR) architecture (Zhang et al., 2020a) . Specifically, for each task we train an ensemble of 10 AR models, for fair comparisons against VLBM which leverages the branching architecture; see Appendix A for details of the AR ensemble setup. Following the DOPE benchmark (Fu et al., 2020b) , our evaluation metrics includes rank correlation, regret@1, and mean absolute error (MAE). VLBM and all baselines are trained using 3 different random seeds over each task, leading to the results reported below. Ablation. Four ablation baselines are also considered, i.e., VLM, VLM+RSA, VLM+RSA(MSE) and VLM+RSA Ensemble. Specifically, VLM refers to the model introduced in Sec. 2.2, trained toward maximizing only the ELBO, i.e., (6). Note that, arguably, VLM could be seen as the generalization of directly applying latent-models proposed in existing RL policy optimization literature (Lee et al., 2020; Hafner et al., 2019; 2020a; b; Lu et al., 2022) ; details can be found in Sec. 4 below. The VLM+RSA ablation baseline follows the same model architecture as VLM, but is trained to optimize over both ELBO and recurrent state alignment (RSA) as introduced in (8), i.e., branching is not used comparing to VLBM. The design of these two baselines can help analyze the effectiveness of the RSA Results. Fig. 3 shows the mean overall performance attained by VLBM and baselines over all the 32 Gym-Mujoco and Adroit tasks. In general VLBM leads to significantly increased rank correlations and decreased regret@1's over existing methods, with MAEs maintained at the state-of-the-art level. Specifically, VLBM achieves state-of-the-art performance in 31, 29, and 15 (out of 32) tasks in terms of rank correlation, regret@1 and MAE, respectively. Performance for each task can be found in Tables 1-6 at the end of Appendices. Note that results for IS, VPM, DICE, DR, and FQE are obtained directly from DOPE benchmark (Fu et al., 2020b) , since the same experimental setup is considered. Fig. 4 and 5 visualize the mean performance for each Gym-Mujoco and Adroit environment respectively, over all the associated datasets. It can be also observed that the model-based and FQE baselines generally perform better than the other baselines, which is consistent with findings from DOPE. The fact that VLM+RSA outperforming the VLM ablation baseline, as shown in Fig. 4 , illustrates the need of the RSA loss term to smooth the flow of information between the encoder and decoder, in the latent space. Moreover, one can observe that VLM+RSA(MSE) sometimes performs worse than VLM, and significantly worse than VLM+RSA in general. Specifically, it has be found that, compared to VLM and VLM+RSA respectively, VLM+RSA(MSE) significantly worsen at least two metrics in 7 and 12 (out of 20) Gym-Mujoco tasks; detailed performance over these tasks can be found in Tables 1-6 at the end of Appendices. Such a finding backs up the design choice of using pairwise loss for RSA instead of MSE, as MSE could be overly strong to regularize the LSTM recurrent states of the encoder and decoder, while pairwise loss only enforces structural similarities. Moreover, VLBM significantly improves rank correlations and regrets greatly compared to VLM+RSA, illustrating the importance of the branching architecture. In the paragraph below, we show empirically the benefits brought in by branching over classic ensembles. Branching versus Classic Ensembles. Fig. 4 shows that the VLM+RSA Ensemble does not improve performance over the VLM+RSA in general, and even leads to worse overall rank correlations and regrets in Walker2d and Hopper environments. This supports the rationale provided in Sec. 2.4 that each decoder still samples from different latent space exclusively, and averaging over the output distributions may not help reduce the disturbance brought in by the modeling artifacts under the variational inference framework, e.g., random weight initializations (Hanin & Rolnick, 2018; Rossi et al., 2019) . In contrast, the VLBM leverages the branching architecture, allowing all the branches to sample from the same latent space formulated by the encoder. Empirically, we find that the branching weights, w b 's in (9), allows VLBM to kill branches that are not helpful toward reconstructing the trajectories accurately, to possibly overcome bad initializations etc. Over all the the 32 tasks we consider, most of VLBMs only keep 1-3 branches (out of 10), i.e., w b < 10 -5 for all other branches. The distribution of all w b 's, from VLBMs trained on the 32 tasks, are shown in Fig. 6 ; one can observe that most of the w b 's are close to zero, while the others generally fall in the range of (0, 0.25] and [0.75, 1). AR ensembles also lead to compelling rank correlations and regrets, but attains much smaller margins in MAEs over other baselines in general; see Fig. 3 . From Fig. 7 , one can observe that it tends to significantly under-estimate most of the high-performing policies. Scatter plots for the other tasks can be found in Appendix A, which also show this trend. The reason could be that its model architecture and training objectives are designed to directly learn the transitions of the MDP; thus, may produce biased predictions when the target policies lead to visitation of the states that are not substantially presented in training data, since such data are obtained using behavioral policies that are sub-optimal. In contrast, the VLBM can leverage RSA and branching against such situations, thus outperforming AR ensembles in most of the OPE tasks in terms of all metrics we considered. Interestingly, Fig. 7 also shows that latent models could sometimes over-estimate the returns. For example, in Hopper-M-E and Walker2d-M-E, VLM tends to over-estimate most policies. The VLBM performs consistently well in Hopper-M-E, but is mildly affected by such an effect in Walker2d-M-E, though over fewer policies and smaller margins. It has been found that variational inference may fall short in approximating true distributions that are asymmetric, and produce biased estimations (Yao et al., 2018) . So the hypothesis would be that the dynamics used to define certain environments may lead to asymmetry in the true posterior p(z t |z t-1 , a t-1 , s t ), which could be hard to be captured by the latent modeling framework we consider. More comprehensive understanding of such behavior can be explored in future work. However, the VLBM still significantly outperforms VLM overall, and achieves top-performing rank correlations and regrets; such results illustrate the VLBM's improved robustness as a result of its architectural design and choices over training objectives. t-SNE Visualization of the Latent Space. Fig. 8 illustrates t-SNE visualization of the latent space by rolling out trajectories using all target policies respectively, followed by feeding the state-action pairs into the encoder of VLBM which maps them into the latent space. It shows the encoded state-action pairs induced from policies with similar performance are in general swirled and clustered together, illustrating that VLBM can learn expressive and disentangled representations of its inputs.

4. RELATED WORK

Latent Modeling in RL. Though variational inference has rarely been explored to facilitate modelbased OPE methods so far, there exist several latent models designed for RL policy optimization that are related to our work, such as SLAC (Lee et al., 2020) , SOLAR (Zhang et al., 2019 ), LatCo (Rybkin et al., 2021) , PlaNet (Hafner et al., 2019) , Dreamer (Hafner et al., 2020a; b) . Below we discuss the connections and distinctions between VLBM and the latent models leveraged by them, with a detailed overview of these methods provided in Appendix G. Specifically, SLAC and SOLAR learn latent representations of the dynamics jointly with optimization of the target policies, using the latent information to improve sample efficiency. Similarly, LatCo performs trajectory optimization over the latent space to allow for temporarily bypassing dynamic constraints. As a result, latent models used in such methods are not designed toward rolling out trajectories independently, as opposed to the use of VLBM in this paper. PlaNet and Dreamer train the recurrent state space model (RSSM) using a growing experience dataset collected by the target policy that is being concurrently updated (with exploration noise added), which requires online data collection. In contrast, under the OPE setup, VLBM is trained over a fixed set of offline trajectories collected over unknown behavioral policies. Moreover, note that the VLM baseline is somewhat reminiscent of the RSSM and similar ones as in Lee et al. (2020) ; Lu et al. (2022) , however, experiments above show that directly using VLM for OPE could lead to subpar performance. On the other hand, though MOPO (Yu et al., 2020) , LOMPO (Rafailov et al., 2021) and COMBO (Yu et al., 2021) can learn from offline data, they focus on quantifying the uncertainty of model's predictions toward next states and rewards, followed by incorporating them into policy optimization objectives to penalize for visiting regions where transitions are not fully captured; thus, such works are also orthogonal to the use case of OPE. OPE. Classic OPE methods adopt IS to estimate expectations over the unknown visitation distribution over the target policy, resulting in weighted IS, step-wise IS and weighted step-wise IS (Precup, 2000) . IS can lead to estimations with low (or zero) bias, but with high variance (Kostrikov & Nachum, 2020; Jiang & Li, 2016) , which sparks a long line of research to address this challenge. DR methods propose to reduce variance by coupling IS with a value function approximator (Jiang & Li, 2016; Thomas & Brunskill, 2016; Farajtabar et al., 2018) . However, the introduction of such approximations may increase bias, so the method proposed in Tang et al. (2019) attempts to balance the scale of bias and variance for DR. Unlike IS and DR methods that require the behavioral policies to be fully known, DICE family of estimators (Zhang et al., 2020c; b; Yang et al., 2021; 2020; Nachum et al., 2019; Dai et al., 2020) and VPM (Wen et al., 2020 ) can be behavioral-agnostic; they directly capture marginalized IS weights as the ratio between the propensity of the target policy to visit particular state-action pairs, relative to their likelihood of appearing in the logged data. There also exist FQE methods which extrapolate policy returns from approximated Q-functions (Hao et al., 2021; Le et al., 2019; Kostrikov & Nachum, 2020) . Existing model-based OPE methods are designed to directly fit MDP transitions using feed-forward (Fu et al., 2020b) or auto-regressive (Zhang et al., 2020a) models, and has shown promising results over model-free methods as reported in a recent benchmark (Fu et al., 2020b) . However, such model-based approaches could be sensitive to the initialization of weights (Hanin & Rolnick, 2018; Rossi et al., 2019) and produce biased predictions, due to the limited coverage over state and action space provided by offline trajectories (Fu et al., 2020b) . Instead, VLBM mitigates such effects by capturing the dynamics over the latent space, such that states and rewards are evolved from a compact feature space over time. Moreover, RSA and the branching can lead to increased expressiveness and robustness, such that future states and rewards are predicted accurately. There also exist OPE methods proposed toward specific applications (Chen et al., 2022; Saito et al., 2021; Gao et al., 2023; 2022b) .

5. CONCLUSION AND FUTURE WORK

We have developed the VLBM which can accurately capture the dynamics underlying environments from offline training data that provide limited coverage of the state and action space; this is achieved by using the RSA term to smooth out the information flow from the encoders to decoders in the latent space, as well as the branching architecture which improve VLBM's robustness against random initializations. We have followed evaluation guidelines provided by the DOPE benchmark, and experimental results have shown that the VLBM generally outperforms the state-of-the-art modelbased OPE method using AR architectures, as well as other model-free methods. VLBM can also facilitate off-policy optimizations, which can be explored in future works. Specifically, VLBM can serve as a synthetic environment on which optimal controllers (e.g., linear-quadratic regulator) can be deployed. On the other hand, similar to Dreamer and SLAC, policies can be updated jointly with training of VLBM, but without the need of online interactions with the environment during training. 

A ADDITIONAL EXPERIMENTAL DETAILS AND RESULTS

Additional Results and Discussions. Rank correlations, regret@1 and MAEs for all 32 tasks are documented in Tables 1-6 below. 6 The mean and standard deviation (in subscripts) over 3 random seeds are reported. Note that in each column, performance of multiple methods may be highlighted in bold, meaning they all achieve the best performance and do not significantly outperform each other. The fact that VLBM outperforms the ablation baselines in most cases suggests that the RSA loss term and branching architecture can effectively increase model expressiveness, and allow to learn the dynamics underlying the MDP more accurately and robustly from offline data that provide limited exploration coverage. Yet, smaller margins are attained between the VLBM and VLM+RSA in Hopper-M-E and Hopper-M. It is likely because Hopper has relatively lower dimensional state space compared to the other three environments, from which the underlying dynamics can be sufficiently captured by the VLM+RSA. Fig. 10 and 11 shows the correlation between estimated (y-axis) and true returns (x-axis) for all the OPE tasks we consider. It can be found that for Halfcheetah-R, -M-R, -M, most of the model-based methods cannot significantly distinguish the returns across target policies. The cause could be that the offline trajectories provided for this task are relatively more challenging, compared to the other OPE tasks. Such an effect appears to affect IS, VPM, DICE, DR and FQE at larger scale. It can be observed from the scatter plots reported in the DOPE benchmark (Fu et al., 2020b ) that these methods could hardly tell the scale of returns across different target policies; as the dots almost form a horizontal line in each plot. However, the estimated returns from VLBM and IS still preserve the rank, which leads to high rank correlations and low regrets. Implementation Details and Hyper-parameter. The model-based methods are evaluated by directly interacting with each target policy for 50 episodes, and the mean of discounted total returns (γ = 0.995) over all episodes is used as estimated performance for the policy. We choose the neural network architectures as follows. For the components involving LSTMs, which include q ψ (z t |z t-1 , a t-1 , s t ) and p ϕ (z t |z t-1 , a t-1 ), their architecture include one LSTM layer with 64 nodes, followed by a dense layer with 64 nodes. All other components do not have LSTM layers involved, so they are constituted by a neural network with 2 dense layers, with 128 and 64 nodes respectively. The output layers that determine the mean and diagonal covariance of diagonal Gaussian distributions use linear and softplus activations, respectively. The ones that determine the mean of Bernoulli distributions (e.g., for capturing early termination of episodes) are configured to use sigmoid activations. VLBM and the two ablation baselines, VLM and VLM+RSA, are trained using offline trajectories provided by DOPE, with max_iter in Alg. 1 set to 1,000 and minibatch size set to 64. Adam optimizer is used to perform gradient descent. To determine the learning rate, we perform grid search among {0.003, 0.001, 0.0007, 0.0005, 0.0003, 0.0001, 0.00005}. Exponential decay is applied to the learning rate, which decays the learning rate by 0.997 every iteration. To train VLBM, we set the constants from equation 10 following C 1 = C 2 , and perform grid search among {5, 1, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0001}. To train VLM+RSA, the constant C from equation 8 is determined by grid search among the same set of parameters above. L2-regularization with decay of 0.001 and batch normalization are applied to all hidden layers. Consider that some of the environments (e.g., Ant, Hopper, Walker2d, Pen) may terminate an episode, before timeout, if the state meets specific conditions; details for VLBM to capture such early termination behavior is introduced in Appendix D. The DOPE Benchmark. The deep OPE (DOPE) benchmark (Fu et al., 2020b) provides standardized training and evaluation procedure for OPE works to follow, which facilitates fair and comprehensive comparisons among various OPE methods. Specifically, it utilizes existing environments and training trajectories provided by D4RLfoot_6 and RLUnpluggedfoot_7 , which are two benchmark suites for offline RL training, and additionally provide target policies for OPE methods to evaluate. In the D4RL branch, the training trajectories are originally collected from various sources including random exploration, human teleoperation, and RL-trained policies with limited exploration; thus, can provide varied levels of coverage over the state-action space. Moreover, the target policies are trained using online RL algorithms, which can in general lead to different state-action visitations than in the training trajectories. We leverage the D4RL branch as our test base, since the OPE tasks it provides are considered challenging, i.e., the limited coverage introduced by training data, as well as the discrepancy between the behavioral and target policies. Graphical illustrations of the Gym-Mujoco and Adroit environments considered are shown in Fig. 9 . Details on the environments and datasets used are shown in Tables 7 and 8 , from the perspectives of state and action dimensions, if episodes can be terminated before timeout, if controls are performed over continuous space, and the size of the offline trajectories used for training. In contrast, in the RLUnplugged branch, the training trajectories are always collected using online RL training, which can result in adequate coverage over the state-action space. The target policies are trained by applying offline RL over the training trajectories, so that behavioral and target policies can lead to similar state-action visitation distributions. As discussed in DOPE (Fu et al., 2020b) , such tasks are suitable for studies where ideal data are needed, such as complexity comparisons. Evaluation Metrics. Following from (Fu et al., 2020b) , we consider rank correlation, regret@1 and mean absolute error (MAE) as the evaluation metrics. Specifically, rank correlation measures the strength and direction of monotonic association between the rank of OPE-estimated returns and true returns over all target policies. It is is captured by Spearsman's correlation coefficient between the ordinal rankings between estimated and true returns. Regret@1 is captured by the difference between the return of the policy corresponding to the highest return as estimated by OPE and the return of the policy that actually produces the highest true return. In other words, regret@1 evaluates how worse the policy resulting in the highest OPE-estimated return would perform than the actual best policy. The two metrics above evaluate how useful OPE would be to facilitate important applications such as policy selection. Finally, we also consider MAE which is commonly used in estimation/regression tasks. Mathematical definitions of these metrics can be found in (Fu et al., 2020b) . Implementation of AR Ensembles. For fair comparisons with VLBM, in experiments we train an ensemble of the state-of-the-art model-based OPE method, auto-regressive (AR) models (Zhang et al., 2020a) , as one of the baselines. Specifically, we train an ensemble of 10 AR models to learn p(s t+1 , r t |s t , a t ) following the auto-regressive manner, with each individual model following the design introduced in (Zhang et al., 2020a)  , i.e., s (j) t+1 ∼ p(s (j) t+1 |s t , a t , s (1) t+1 , . . . , s (j-1) t+1 ), t+1 representing the element located at the j-th dimension of the state variable, and D the dimension of state space. The reward is treated as an additional dimension of the states, i.e., r t ∼ p(r t |s t , a t , s (1) t+1 , . . . , s (D) t+1 ). However, in the original literature (Zhang et al., 2020a) it does not introduce in details regarding which specific ensemble architecture is used (e.g., overall averaging or weighted averaging). As a result, we choose the same weighted averaging procedure as used in VLBM branching, to sort out the influence of different ensemble architectures and facilitate fair comparisons. Specifically, a total of 10 AR models, parameterized by {θ 1 , . . . , θ 10 }, along with 10 weight variables {w θ 1 , . . . , w θ 10 | i w θ i = 1}, are trained. Similar to weighted averaging architecture used in VLBM, i.e., equation 9, the mean and variance of the prediction s (j) t+1 , captured by normal distribution N (µ, σ 2 ), follow µ = 10 i=1 w θ i • µ θi (s (j) t+1 ), σ 2 = 10 i=1 (w θ i ) 2 • σ 2 θi (s (j) t+1 ), where µ θi (s License. The use of DOPEfoot_8 and D4RL (Fu et al., 2020a) follow the Apache License 2.0. 8 ), as the state-action pairs induced by policies with different levels of performance are generally cluster together without explicit boundaries. Such a finding illustrated the importance of the use of RSA loss (7) empirically, as it can effectively regularize p ψ (z t |z t-1 , a t-1 , s t ) and allows the encoder to map the MDP states to an expressive and compact latent space from which the decoder can reconstruct states and rewards accurately. Moreover, Figure 13 shows that the latent representations of the state-action pairs captured by VLM+RSA(MSE) distributed almost uniformly over the latent space. This justifies the rationale provided in Sec. 2.3 where MSE is too strong to regularize the hidden states of the encoder and decoder, and is also consistent with the results reported in Figure 3 that MSE+RSA(MSE) performs worse than VLM in general.

C ALGORITHMS FOR TRAINING AND EVALUATING VLBM

Algorithm 1 Train VLBM. Input: Model weights ψ, ϕ 1 , . . . , ϕ B , w 1 , . . . , w B , offline trajectories ρ β , and learning rate α.

Begin:

1: Initialize ψ, ϕ 1 , . . . , ϕ B , w 1 , . . . , w B 2: for iter in 1 : max_iter do 3: Sample a trajectory [(s 0 , a 0 , r 0 , s 1 ), . . . , (s T -1 , a T -1 , r T -1 , s T )] ∼ ρ β 4: z ψ 0 ∼ q ψ (z 0 |s 0 ) 5: z ϕ b 0 ∼ p(z 0 ), for all b ∈ [1, B] 6: Run forward pass of VLBM following (3), ( 5) and ( 9) for t = 1 : T , and collect all variables needed to evaluate L V LBM as specified in (10). 7: ψ ← ψ + α∇ ψ L V LBM 8: for b in 1 : B do 9: ϕ b ← ϕ b + α∇ ϕ b L V LBM 10: w b ← w b + α∇ w b L V LBM 11: end for 12: end for Algorithm 2 Evaluate VLBM. Input: Trained model weights ψ, ϕ 1 , . . . , ϕ B , w 1 , . . . , w B Begin: 1: Initialize the list that stores the accumulated returns over all episodes R = [] 2: for epi in 1 : max_epi do 3: Initialize the variable r = 0 that tracks the accumulated return for the current episode for t in 1 : T do 8: Determine the action following the target policy π, i.e., a t-1 ∼ π(a t-1 |s ϕ t-1 ) 9: for b in 1 : B do 10: Update h ϕ b t , hϕ b t , z ϕ b t , s ϕ b t , r ϕ b t-1 following (5). 11: end for 12: Generate the next state s ϕ t following (9), as well as the reward r ϕ t-1 ∼ p ϕ (r t-1 |z ϕ1 t , . . . , z ϕ B t ) = N µ = b w b • µ(r ϕ b t-1 ), Σ diag = b w 2 b • Σ diag (r ϕ b t-1 ) 13: Update r ← r + γ t-1 r ϕ t-1 , with γ being the discounting factor 14: end for 15: Append r into R 16: end for 17: Average over all elements in R, which serves as the estimated return over π

D EARLY TERMINATION OF ENVIRONMENTS

Given that some Gym-Mujoco environments, including Ant, Hopper, Walker2d and Pen, may terminate an episode before reaching the maximum steps, if the state violates specific constraints. Below we introduce how VLM and VLBM can be enriched to capture such early termination behaviors. VLM For VLM, we introduce an additional component d ϕ t ∼ p ϕ (d t |z ϕ t ) to the generative process equation 5, where d ϕ t is a Bernoulli variable determining if an episode should be terminated at its t-th step. Specifically, p ϕ (d t |z ϕ t ) follows Bernoulli distribution, with mean determined by an MLP with sigmoid activation applied to the output layer. As a result, the generative process now follows h ϕ t = f ϕ (h ϕ t-1 , z ϕ t-1 , a t-1 ), hϕ t = g ϕ (h ϕ t ), z ϕ t ∼ p ϕ ( hϕ t ), s ϕ t ∼ p ϕ (s t |z ϕ t ), r ϕ t-1 ∼ p ϕ (r t-1 |z ϕ t ), d ϕ t ∼ p ϕ (d t |z ϕ t ), a t ∼ π(a t |s ϕ t ). Moreover, we add in a new term to VLM's training objective, in order to update the component introduced above during training, i.e., L early_term V LM (ψ, ϕ) = L V LM (ψ, ϕ) + T t=0 log p ϕ (d t |z t ), with L V LM (ψ, ϕ) being the original objective of VLM, as presented in equation 8. VLBM For VLBM, the termination of an episode is determined following, i.e., d ϕ t ∼ p ϕ (d t |z ϕ1 t , . . . , z ϕ B t ) = Bernoulli(µ = b w b • µ d (d ϕ b t )), where µ d (d ϕ b t ) = ϕ M LP b,µ d (z ϕ b t ) is the mean of d ϕ b t produced from the b-th branch of the decoder, and ϕ M LP b,µ d is the corresponding MLP that maps z ϕ b t to µ d (d ϕ b t ). To update the components involved in the procedure above, we introduce a new term to the VLBM's objective, i.e., L early_term V LBM (ψ, ϕ 1 , . . . , ϕ B , w 1 , • • • , w B ) (16) =L V LBM (ψ, ϕ 1 , . . . , ϕ B , w 1 , • • • , w B ) + T t=0 log p ϕ (d ϕ t |z ϕ1 t , . . . , z ϕ B t ), with L V LBM being the original objective of VLBM, as presented in equation 10.

E BOUND DERIVATION

We now derive the evidence lower bound (ELBO) for the joint log-likelihood distribution, i.e., log p ϕ (s 0:T , r 0:T -1 ) = log z 1:T ∈Z p ϕ (s 0:T , z 1:T , r 0:T -1 )dz = log z 1:T ∈Z p ϕ (s 0:T , z 1:T , r 0:T -1 ) q ψ (z 0:T |s 0:T , a 0:T -1 ) q ψ (z 0:T |s 0:T , a 0:T -1 )dz (20) ≥E q ψ [log p(z 0 ) + log p ϕ (s 0:T , z 1:T , r 0:T -1 |z 0 ) -log q ψ (z 0:T |s 0:T , a 0:T -1 )] (21) =E q ψ log p(z 0 ) + log p ϕ (s 0 |z 0 ) + T t=1 log p ϕ (s t , z t , r t-1 |z t-1 , a t-1 ) -log q ψ (z 0 |s 0 ) - T t=1 log q ψ (z t |z t-1 , a t-1 , s t ) (22) =E q ψ log p(z 0 ) -log q ψ (z 0 |s 0 ) + log p ϕ (s 0 |z 0 ) + T t=1 log p ϕ (s t |z t )p ϕ (r t-1 |z t )p ϕ (z t |z t-1 , a t-1 ) - T t=1 log q ψ (z t |z t-1 , a t-1 , s t ) (23) =E q ψ T t=0 log p ϕ (s t |z t ) + T t=1 log p ϕ (r t-1 |z t ) -KL q ψ (z 0 |s 0 )||p(z 0 ) - T t=1 KL q ψ (z t |z t-1 , a t-1 , s t )||p ϕ (z t |z t-1 , a t-1 ) . Note that the transition from equation 20 to equation 21 follows Jensen's inequality. 



We slightly abuse the notation ρ β , to represent either the trajectories or state-action visitation distribution under the behavioral policy, depending on the context. Assume that different dimensions of the states are non-correlated with each other. Otherwise, the states can be projected to orthogonal basis, such that non-diagonal elements of the covariance matrix will be zeros. Rewards and actions are omitted for conciseness of the presentation. For simplicity, the parts generating rewards are omitted without lost of generality. From now on the dataset names are abbreviated by their initials, e.g., Ant-M-R refers to Ant-Medium-Replay. Some VPM entries are absent since they were not reported inFu et al. (2020b), nor the code is open-sourced. https://github.com/rail-berkeley/d4rl https://github.com/deepmind/deepmind-research/tree/master/rl_unplugged https://github.com/google-research/deep_ope



Figure 2: (Left) Recurrent state alignment (RSA) applied over the recurrent hidden states between inference and generative process illustrated separately. (Right) Single-step forward pass of the variational latent branching model (VLBM), the training objectives for each branch and final predictions.

Though the first term above already propagates through all w b 's and ϕ b 's, the third term and constraints over w b 's regularize ϕ b in each individual branch such that they are all trained toward maximizing the likelihood p ϕ b (s ϕ b t |z ϕ b t ). Pseudo-code for training and evaluating the VLBM can be found in Appendix C. Further, in practice, one can define w b =

Figure 3: Mean rank correlation, regret@1 and MAE over all the 32 Gym-Mujoco and Adroit tasks, showing VLBM achieves state-of-the-art performance overall.

Figure 4: Mean rank correlation, regret@1 and MAE over all datasets, for each Mujoco environment.

Figure 5: Mean rank correlation, regret@1 and MAE over all datasets, for each Adroit environment.loss term and branching architecture introduced in Sec. 2.3 and 2.4. Moreover, VLM+RSA(MSE) uses mean squared error to replace the pairwise loss introduced in (7), and the VLM+RSA Ensemble applies classic ensembles by averaging over B VLM+RSA models end-to-end, instead of branching from decoder as in VLBM. These two ablation baselines can help justify the use of pairwise loss for RSA, and the benefit of using branching architecture over classic ensembles.

Figure 6: Distribution of all branching weights, w b 's, over all VLBMs trained on the 32 tasks.

Figure 8: t-SNE visualization over the latent space, capturing encoded state-action visitations induced from all target policies. Each point is colored by the corresponding policy from which it is generated. Policies in the legend are sorted in the order of increasing performance.

Figure 7: Correlation between the estimated (yaxis) and true returns (x-axis), across different model-based OPE methods and environments.

Figure 9: The Gym-Mujoco and Adroit environments considered by the D4RL branch of DOPE.

the mean and variance produced from each individual AR model in the ensemble. Training Resources. Training of the proposed method, and baselines, are facilitated by Nvidia Quadro RTX 6000, NVIDIA RTX A5000, and NVIDIA TITAN XP GPUs.

Figure 10: Scatter plots between OPE-estimated (y-axis) and true (x-axis) returns over all 20 Gym-Mujoco tasks that are considered. Part 1.

Figure 13: t-SNE visualization over the latent space captured by VLM+RSA(MSE), illustrating encoded state-action visitations induced from all target policies. Each point is colored by the corresponding policy from which it is generated. Policies in the legend are sorted in the order of increasing performance.

Figures 12 and 13 above visualize the latent space captured by two ablation baselines, VLM and VLM+RSA(MSE), respectively. It can be observed that comparing to the latent space captured by VLM are not disentangled well compared to VLBM (shown in Figure8), as the state-action pairs induced by policies with different levels of performance are generally cluster together without explicit boundaries. Such a finding illustrated the importance of the use of RSA loss (7) empirically, as it can effectively regularize p ψ (z t |z t-1 , a t-1 , s t ) and allows the encoder to map the MDP states to an expressive and compact latent space from which the decoder can reconstruct states and rewards accurately. Moreover, Figure13shows that the latent representations of the state-action pairs captured by VLM+RSA(MSE) distributed almost uniformly over the latent space. This justifies the rationale provided in Sec. 2.3 where MSE is too strong to regularize the hidden states of the encoder and decoder, and is also consistent with the results reported in Figure3that MSE+RSA(MSE) performs worse than VLM in general.

from the prior, i.e., z ϕ b 0 ∼ p(z 0 ) for all b ∈ [1, B] 5: Initialize LSTM hidden states h ϕ b 0 = 0 for all b ∈ [1, B] 6: Sample s ϕ b 0 ∼ p ϕ (s 0 |z ϕ b t ) for all b ∈ [1, B] and generate initial MDP state s ϕ 0 following (9) 7:

Overview of latent-model based RL methods. In SLAC, latent representations are used to improve the sample efficiency of model-free RL training algorithms, by jointly modeling and learning dynamics and controls over the latent space. Similarly, SOLAR improves data efficiency for multi-task RL by first learning high-level latent representations of the environment, which can be shared across different tasks. Then, local dynamics models are inferred from the abstraction, with controls solved by linear-quadratic regulators. PlaNet and Dreamer further improve the architecture and training objectives of latent models, allowing them to look ahead multiple steps and plan for longer horizon. There also exist LatCo which directly performs trajectory optimization over the latent space, allowing the agent to temporarily bypass dynamical constraints and quickly navigate to the high-reward regions in early training stage. To summarize, methods above leverage latent representations to gain sufficient exploration coverage and quickly navigate to high-reward regions, improving sample efficiency for policy optimization. Note that they mostly require online interactions with the environment to formulate a growing experience replay buffer for policy learning, which have different goals than OPE which requires learning from a fixed set of offline trajectories. Regret@1 for all Adroit tasks. Results are obtained by averaging over 3 random seeds used for training, with standard deviations shown in subscripts.

MAE between estimated and ground-truth returns for all Gym-Mujoco tasks. Results are obtained by averaging over 3 random seeds used for training, with standard deviations shown in subscripts.

Summary of the Gym-Mujoco environments and datasets used to train VLBM and baselines.

Summary of the Adroit environments and datasets used to train VLBM and baselines.

ACKNOWLEDGMENTS

This work is sponsored in part by the AFOSR under award number FA9550-19-1-0169, and by the NSF CNS-1652544, CNS-1837499, DUE-1726550, IIS-1651909 and DUE-2013502 awards, as well as the National AI Institute for Edge Computing Leveraging Next Generation Wireless Networks, Grant CNS-2112562.

F BASICS OF VARIATIONAL INFERENCE

Classic variational auto-encoders (VAEs) are designed to generate synthetic data that share similar characteristics than the ones used for training (Kingma & Welling, 2013) . Specifically, VAEs learn an approximated posterior q ψ (z|x) and a generative model p ϕ (x|z), over the prior p(z), with x being the data and z the latent variable. It's true posterior p ϕ (z|x) is intractable, i.e., p ϕ (z|x) = p ϕ (x|z)p(z) p ϕ (x) ;since the marginal likelihood in the denominator, p ϕ (x) = z p ϕ (x|z)p(z)dz, requires integration over the unknown latent space. For the same reason, VAEs cannot be trained to directly maximize the marginal log-likelihood, max log p ϕ (x). To resolve this, one could maximize a lower bound of p ϕ (x), i.e., max ψ,ϕ-KL(q ψ (z|x)||p(zwhich is the evidence lower bound (ELBO).

Reparameterization.

During training, it is required to sample from q ψ (z|x) and p ϕ (x|z) constantly.The reparameterization technique is introduced in (Kingma & Welling, 2013), to ensure that the gradients can flow through such sampling process during back-propagation. For example, if both distributions (q ψ (z|x) and p ϕ (x|z)) follow diagonal Gaussians, with mean and diagonal covariance determined by MLPs, i.e.,here,

Σ

are the MLPs that generate the means and covariances. The sampling processes above can be captured by reparameterization, i.e.,with ϵ ∼ N (0, I). Consequently, the gradients over ψ and ϕ can be calculated following the chain rule, and used for back-propagation during training. We direct readers to (Kingma & Welling, 2013) for a comprehensive review of reparameterization.Published as a conference paper at ICLR 2023 

