WASSERSTEIN AUTO-ENCODED MDPS FORMAL VERIFICATION OF EFFICIENTLY DISTILLED RL POLICIES WITH MANY-SIDED GUARANTEES

Abstract

Although deep reinforcement learning (DRL) has many success stories, the largescale deployment of policies learned through these advanced techniques in safetycritical scenarios is hindered by their lack of formal guarantees. Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling formally verifiable controllers from any RL policy. While the related guarantees address relevant practical aspects such as the satisfaction of performance and safety properties, the VAE approach suffers from several learning flaws (posterior collapse, slow learning speed, poor dynamics estimates), primarily due to the absence of abstraction and representation guarantees to support latent optimization. We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent space model that fixes those issues by minimizing a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy, for which the formal guarantees apply. Our approach yields bisimulation guarantees while learning the distilled policy, allowing concrete optimization of the abstraction and representation model quality. Our experiments show that, besides distilling policies up to 10 times faster, the latent model quality is indeed better in general. Moreover, we present experiments from a simple time-to-failure verification algorithm on the latent space. The fact that our approach enables such simple verification techniques highlights its applicability.

1. INTRODUCTION

Reinforcement learning (RL) is emerging as a solution of choice to address challenging real-word scenarios such as epidemic mitigation and prevention strategies (Libin et al., 2020) , multi-energy management (Ceusters et al., 2021) , or effective canal control (Ren et al., 2021) . RL enables learning high performance controllers by introducing general nonlinear function approximators (such as neural networks) to scale with high-dimensional and continuous state-action spaces. This introduction, termed deep-RL, causes the loss of the conventional convergence guarantees of RL (Tsitsiklis, 1994) as well as those obtained in some continuous settings (Nowe, 1994) , and hinders their wide roll-out in critical settings. This work enables the formal verification of any such policies, learned by agents interacting with unknown, continuous environments modeled as Markov decision processes (MDPs). Specifically, we learn a discrete representation of the state-action space of the MDP, which yield both a (smaller, explicit) latent space model and a distilled version of the RL policy, that are tractable for model checking (Baier & Katoen, 2008) . The latter are supported by bisimulation guarantees: intuitively, the agent behaves similarly in the original and latent models. The strength of our approach is not simply that we verify that the RL agent meets a predefined set of specifications, but rather provide an abstract model on which the user can reason and check any desired agent property. Variational MDPs (VAE-MDPs, Delgrange et al. 2022 ) offer a valuable framework for doing so. The distillation is provided with PAC-verifiable bisimulation bounds guaranteeing that the agent behaves similarly (i) in the original and latent model (abstraction quality); (ii) from all original states embedded to the same discrete state (representation quality). Whilst the bounds offer a confidence metric that enables the verification of performance and safety properties, VAE-MDPs suffer from several learning flaws. First, training a VAE-MDP relies on variational proxies to the bisimulation bounds, meaning there is no learning guarantee on the quality of the latent model via its optimization. Second, variational autoencoders (VAEs) (Kingma & Welling, 2014; Hoffman et al., 2013) are known to suffer from posterior collapse (e.g., Alemi et al. 2018) resulting in a deterministic mapping to a unique latent state in VAE-MDPs. Most of the training process focuses on handling this phenomenon and setting up the stage for the concrete distillation and abstraction, finally taking place in a second training phase. This requires extra regularizers, setting up annealing schemes and learning phases, and defining prioritized replay buffers to store transitions. Distillation through VAE-MDPs is thus a meticulous task, requiring a large step budget and tuning many hyperparameters. Building upon Wasserstein autoencoders (Tolstikhin et al., 2018) instead of VAEs, we introduce Wasserstein auto-encoded MDPs (WAE-MDPs), which overcome those limitations. Our WAE relies on the optimal transport (OT) from trace distributions resulting from the execution of the RL policy in the real environment to that reconstructed from the latent model operating under the distilled policy. In contrast to VAEs which rely on variational proxies, we derive a novel objective that directly incorporate the bisimulation bounds. Furthermore, while VAEs learn stochastic mappings to the latent space which need be determinized or even entirely reconstructed from data at the deployment time to obtain the guarantees, our WAE has no such requirements, and learn all the necessary components to obtain the guarantees during learning, and does not require such post-processing operations. Those theoretical claims are reflected in our experiments: policies are distilled up to 10 times faster through WAE-than VAE-MDPs and provide better abstraction quality and performance in general, without the need for setting up annealing schemes and training phases, nor prioritized buffer and extra regularizer. Our distilled policies are able to recover (and sometimes even outperform) the original policy performance, highlighting the representation quality offered by our new framework: the distillation is able to remove some non-robustness of the input RL policy. Finally, we formally verified time-to-failure properties (e.g., Pnueli 1977) to emphasize the applicability of our approach. Other Related Work. Complementary works approach safe RL via formal methods (Junges et al., 2016; Alshiekh et al., 2018; Jansen et al., 2020; Simão et al., 2021) , aimed at formally ensuring safety during RL, all of which require providing an abstract model of the safety aspects of the environment. They also include the work of Alamdari et al. (2020) , applying synthesis and model checking on policies distilled from RL, without quality guarantees. Other frameworks share our goal of verifying deep-RL policies (Bacci & Parker, 2020; Carr et al., 2020) but rely on a known environment model, among other assumptions (e.g., deterministic or discrete environment). Finally, DeepSynth (Hasanbeig et al., 2021) allows learning a formal model from execution traces, with the different purpose of guiding the agent towards sparse and non-Markovian rewards. On the latent space training side, WWAEs (Zhang et al., 2019) reuse OT as latent regularizer discrepancy (in Gaussian closed form), whereas we derive two regularizers involving OT. These two are, in contrast, optimized via the dual formulation of Wasserstein, as in Wassertein-GANs (Arjovsky et al., 2017) . Similarly to VQ-VAEs (van den Oord et al., 2017) and Latent Bernoulli AEs (Fajtl et al., 2020) , our latent space model learns discrete spaces via deterministic encoders, but relies on a smooth approximation instead of using the straight-through gradient estimator. Works on representation learning for RL (Gelada et al., 2019; Castro et al., 2021; Zhang et al., 2021; Zang et al., 2022) consider bisimulation metrics to optimize the representation quality, and aim at learning (continuous) representations which capture bisimulation, so that two states close in the representation are guaranteed to provide close and relevant information to optimize the performance of the controller. In particular, as in our work, DeepMDPs (Gelada et al., 2019) are learned by optimizing local losses, by assuming a deterministic MDP and without verifiable confidence measurement.

2. BACKGROUND

In the following, we write ∆pX q for the set of measures over (complete, separable metric space) X . Markov decision processes (MDPs) are tuples M " xS, A, P, R, ℓ, AP, s I y where S is a set of states; A, a set of actions; P : S ˆA Ñ ∆pSq, a probability transition function that maps the current state and action to a distribution over the next states; R : S ˆA Ñ R, a reward function; ℓ : S Ñ 2 AP , a labeling function over a set of atomic propositions AP; and s I P S, the initial state. If |A| " 1, M is a fully stochastic process called a Markov chain (MC). We write M s for the MDP obtained when replacing the initial state of M by s P S. An agent interacting in M produces trajectories, i.e., sequences of states and actions τ " xs 0:T , a 0:T ´1y where s 0 " s I and s t`1 " Pp¨| s t , a t q for t ă T . The set of infinite trajectories of M is Traj . We assume AP and labels being respectively one-hot and binary encoded. Given T Ď AP, we write s |ù T if s is labeled with T, i.e., ℓpsq X T ‰ H, and s |ù ␣T for s |ù T. We refer to MDPs with continuous state or action spaces as continuous MDPs. In that case, we assume S and A are complete separable metric spaces equipped with a Borel σ-algebra, and ℓ ´1pTq is Borel-measurable for any T Ď AP. Policies and stationary distributions. A (memoryless) policy π : S Ñ ∆pAq prescribes which action to choose at each step of the interaction. The set of memoryless policies of M is Π. The MDP M and π P Π induce an MC M π with unique probability measure P M π on the Borel σ-algebra over measurable subsets φ Ď Traj (Puterman, 1994) . We drop the superscript when the context is clear. Define ξ t π ps 1 | sq " P Ms π pts 0:8 , a 0:8 | s t " s 1 uq as the distribution giving the probability of being in each state of M s after t steps. B Ď S is a bottom strongly connected component (BSCC) of M π if (i) B is a maximal subset satisfying ξ t π ps 1 | sq ą 0 for any s, s 1 P B and some t ě 0, and (ii) Ea"πp¨|sq PpB | s, aq " 1 for all s P S. The unique stationary distribution of B is ξ π P ∆pBq. We write s, a " ξ π for sampling s from ξ π then a from π. An MDP M is ergodic if for all π P Π, the state space of M π consists of a unique aperiodic BSCC with ξ π " lim tÑ8 ξ t π p¨| sq for all s P S. Value objectives. Given π P Π, the value of a state s P S is the expected value of a random variable obtained by running π from s. For a discount factor γ P r0, 1s, we consider the following objectives. (i) Discounted return: we write V π psq " E Ms π "ř 8 t"0 γ t Rps t , a t q ‰ for the expected discounted rewards accumulated along trajectories. The typical goal of an RL agent is to learn a policy π ‹ that maximizes V π ‹ ps I q through interactions with the (unknown) MDP; (ii) Reachability: let C, T Ď AP, the (constrained) reachability event is C U T " t s 0:8 , a 0:8 | Di P N, @j ă i, s j |ù C ^si |ù T u Ď Traj . We write V φ π psq " E Ms π " γ t ‹ 1 xs0:8,a0:8y P φ ‰ for the discounted probability of satisfying φ " C U T, where t ‹ is the length of the shortest trajectory prefix that allows satisfying φ. Intuitively, this denotes the discounted return of remaining in a region of the MDP where states are labeled with C, until visiting for the first time a goal state labeled with T, and the return is the binary reward signal capturing this event. Safety w.r.t. failure states C can be expressed as the safety-constrained reachability to a destination T through ␣C U T. Notice that V φ π psq " P Ms π pφq when γ " 1. Latent MDP. Given the original (continuous, possibly unknown) environment model M, a latent space model is another (smaller, explicit) MDP M " @ S, A, P, R, ℓ, AP, s I D with state-action space linked to the original one via state and action embedding functions: ϕ : S Ñ S and ψ : S ˆA Ñ A. We refer to @ M, ϕ, ψ D as a latent space model of M and M as its latent MDP. Our goal is to learn @ M, ϕ, ψ D by optimizing an equivalence criterion between the two models. We assume that d S is a metric on S, and write Π for the set of policies of M and V π for the values of running π P Π in M. Remark 1 (Latent flow). The latent policy π can be seen as a policy in M (cf. Fig. 1a ): states passed to π are first embedded with ϕ to the latent space, then the actions produced by π are executed via ψ in the original environment. Let s P S, we write a " πp¨| sq for πp¨| ϕpsqq, then the reward and next state are respectively given by Rps, aq " Rps, ψpϕpsq, aqq and s 1 " Pp¨| s, aq " Pp¨| s, ψpϕpsq, aqq. Local losses allow quantifying the distance between the original and latent reward/transition functions in the local setting, i.e., under a given state-action distribution ξ P ∆ `S ˆA˘: L ξ R " E s,a"ξ ˇˇRps, aq ´Rpϕpsq, aq ˇˇ, L ξ P " E s,a"ξ D `ϕPp¨| s, aq, Pp¨| ϕpsq, aq where ϕPp¨| s, aq is the distribution of drawing s 1 " Pp¨| s, aq then embedding s 1 " ϕps 1 q, and D is a discrepancy measure. Fig 1a depicts the losses when states and actions are drawn from a stationary distribution ξ π resulting from running π P Π in M. In this work, we focus on the case where D is the Wasserstein distance W d S : given two distributions P, Q over a measurable set X equipped with a metric d, W d is the solution of the optimal transport (OT) from P to Q, i.e., the minimum cost of changing P into Q (Villani, 2009) : W d pP, Qq " inf λPΛpP,Qq Ex,y"λ dpx, yq, ΛpP, Qq being the set of all couplings of P and Q. The Kantorovich duality yields W d pP, Qq " sup f PF d Ex"P f pxq ´Ex"Q f pyq where F d is the set of 1-Lipschiz functions. Local losses are related to a well-established behavioral equivalence between transition systems, called bisimulation. Bisimulation. A bisimulation B on M is a behavioral equivalence between states s 1 , s 2 P S so that, s 1 B s 2 iff (i) PpT | s 1 , aq " PpT | s 2 , aq, (ii) ℓps 1 q " ℓps 2 q, and (iii) Rps 1 , aq " Rps 2 , aq for each action a P A and (Borel measurable) equivalence class T P S{B. Properties of bisimulation include trajectory and value equivalence (Larsen & Skou, 1989; Givan et al., 2003) . Requirements (ii) and (iii) can be respectively relaxed depending on whether we focus only on behaviors formalized through AP or rewards. The relation can be extended to compare two MDPs (e.g., M and M) by considering the disjoint union of their state space. We denote the largest bisimulation relation by ". Characterized by a logical family of functional expressions derived from a logic L, bisimulation pseudometrics (Desharnais et al., 2004) generalize the notion of bisimilariy. More specifically, given a policy π P Π, we consider a family F of real-valued functions parameterized by a discount factor γ and defining the semantics of L in M π . Such functional expressions allow to formalize discounted properties such as reachability, safety, as well as general ω-regular specifications (Chatterjee et al., 2010) and may include rewards as well (Ferns et al., 2014) . The pseudometric d " π is defined as the largest behavioral difference d " π ps 1 , s 2 q " sup f PF |f ps 1 q ´f ps 2 q|, and its kernel is bisimilarity: d " π ps 1 , s 2 q " 0 iff s 1 " s 2 . In particular, value functions are Lipschitz-continuous w.r.t. d " π : |V πps 1 q ´V πps 2 q| ď K d " π ps 1 , s 2 q, where K is 1 {p1´γq if rewards are included in F and 1 otherwise. To ensure the upcoming bisimulation guarantees, we make the following assumptions: Assumption 2.1. MDP M is ergodic, ImpRq is a bounded space scaled in r ´1{2, 1 {2s, and the embedding function preserves the labels, i.e., ϕpsq " s ùñ ℓpsq " ℓpsq for s P S, s P S. Note that the ergodicity assumption is compliant with episodic RL and a wide range of continuous learning tasks (see Huang 2020; Delgrange et al. 2022 for detailed discussions on this setting). Bisimulation bounds (Delgrange et al., 2022) . M being set over continuous spaces with possibly unknown dynamics, evaluating d " can turn out to be particularly arduous, if not intractable. A solution is to evaluate the original and latent model bisimilarity via local losses: fix π P Π, assume M is discrete, then given the induced stationary distribution ξ π in M, let s 1 , s 2 P S with ϕps 1 q " ϕps 2 q: E s"ξπ d " π ps, ϕpsqq ď L ξπ R `γL ξπ P 1 ´γ , d " π ps 1 , s 2 q ď ´Lξπ R `γL ξπ P 1 ´γ ¯`ξ ´1 π ps 1 q `ξ´1 π ps 2 q ˘. (1) The two inequalities guarantee respectively the quality of the abstraction and representation: when local losses are small, (i) states and their embedding are bisimilarly close in average, and (ii) all states sharing the same discrete representation are bisimilarly close. The local losses and related bounds can be efficiently PAC-estimated. Our goal is to learn a latent model where the behaviors of the agent executing π can be formally verified, and the bounds offer a confidence metric allowing to lift the guarantees obtained this way back to the original model M, when the latter operates under π. We show in the following how to learn a latent space model by optimizing the aforementioned bounds, and distill policies π P Π obtained via any RL technique to a latent policy π P Π.

3. WASSERSTEIN AUTO-ENCODED MDPS

Fix M θ " @ S, A, P θ , R θ , ℓ, AP, s I D and @ M θ , ϕ ι , ψ θ D as a latent space model of M parameterized by ι and θ. Our method relies on learning a behavioral model ξ θ of M from which we can retrieve the latent space model and distill π. This can be achieved via the minimization of a suitable discrepancy between ξ θ and M π . VAE-MDPs optimize a lower bound on the likelihood of the dynamics of M π using the Kullback-Leibler divergence, yielding (i) M θ , (ii) a distillation π θ of π, and (iii) ϕ ι and ψ θ . Local losses are not directly minimized, but rather variational proxies that do not offer theoretical guarantees during the learning process. To control the local losses minimization and exploit their theoretical guarantees, we present a novel autoencoder that incorporates them in its objective, derived from the OT. Proofs of the claims made in this Section are provided in Appendix A.

3.1. THE OBJECTIVE FUNCTION

Assume that S, A, and ImpRq are respectively equipped with metrics d S , d A , and d R , we define the raw transition distance metric ⃗ d as the component-wise sum of distances between states, actions, and rewards occurring of along transitions: ⃗ dpxs 1 , a 1 , r 1 , s 1 1 y , xs 2 , a 2 , r 2 , s 1 2 yq " d S ps 1 , s 2 q `dA pa 1 , a 2 q `dR pr 1 , r 2 q `dS ps 1 1 , s 1 2 q. Given Assumption 2.1, we consider the OT between local distributions, where traces are drawn from episodic RL processes or infinite interactions (we show in Appendix A.1 that considering the OT between trace-based distributions in the limit amounts to reasoning about stationary distributions). Our goal is to minimize W ⃗ d pξ π , ξ θ q so that ξ θ `s, a, r, s 1 ˘" ż SˆAˆS P θ `s, a, r, s 1 | s, a, s 1 ˘d ξπ θ `s, a, s 1 ˘, where P θ is a transition decoder and ξπ θ denotes the stationary distribution of the latent model M θ . As proved by Bousquet et al. (2017) , this model allows to derive a simpler form of the OT: instead of finding the optimal coupling of (i) the stationary distribution ξ π of M π and (ii) the behavioral model ξ θ , in the primal definition of W ⃗ d pξ π , ξ θ q, it is sufficient to find an encoder q whose marginal is given by Qps, a, s 1 q " Es,a,s 1 "ξπ qps, a, s 1 | s, a, s 1 q and identical to ξ π . This is summarized in the following Theorem, yielding a particular case of Wasserstein-autoencoder Tolstikhin et al. ( 2018): Theorem 3.1. Let ξ θ and P θ be respectively a behavioral model and transition decoder as defined in Eq. 2, G θ : S Ñ S be a state-wise decoder, and ψ θ be an action embedding function. Assume P θ is deterministic with Dirac function G θ ps, a, s 1 q " @ G θ psq, ψ θ ps, aq, R θ ps, aq, G θ ps 1 q D , then W ⃗ d pξ π , ξ θ q " inf q: Q" ξπ θ E s,a,r,s 1 "ξπ E s,a,s 1 "qp¨|s,a,s 1 q ⃗ d `@s, a, r, s 1 D , G θ `s, a, s 1 ˘˘. Henceforth, fix ϕ ι : S Ñ S and ϕ A ι : S ˆA Ñ ∆ `A˘a s parameterized state and action encoders with ϕ ι ps, a, s 1 | s, a, s 1 q " 1 ϕιpsq"s ¨ϕA ι pa | s, aq ¨1ϕιps 1 q"s 1 , and define the marginal encoder as Q ι " Es,a,s 1 "ξπ ϕ ι p¨| s, a, s 1 q. Training the model components can be achieved via the objective: min ι,θ E s,a,r,s 1 "ξπ E s,a,s 1 "ϕιp¨|s,a,s 1 q ⃗ d `@s, a, r, s 1 D , G θ `s, a, s 1 ˘˘`β ¨D`Q ι , ξπ θ ˘, where D is an arbitrary discrepancy metric and β ą 0 a hyperparameter. Intuitively, the encoder ϕ ι can be learned by enforcing its marginal distribution Q ι to match ξπ θ through this discrepancy. Remark 2. If M has a discrete action space, then learning A is not necessary. We can set A " A using identity functions for the action encoder and decoder (details in Appendix A.2). When π is executed in M, observe that its parallel execution in M θ is enabled by the action encoder ϕ A ι : given an original state s P S, π first prescribes the action a " πp¨| sq, which is then embedded in the latent space via a " ϕ A ι p¨| ϕ ι psq, aq (cf. Fig. 1b ). This parallel execution, along with setting D to W ⃗ d , yield an upper bound on the latent regularization, compliant with the bisimulation bounds. A two-fold regularizer is obtained thereby, defining the foundations of our objective function: Lemma 3.2. Define T ps, a, s 1 q " Es,a"ξ π r1 ϕιpsq"s ¨ϕA ι pa | s, aq¨P θ ps 1 | s, aqs as the distribution of drawing state-action pairs from interacting with M, embedding them to the latent spaces, and finally letting them transition to their successor state in M θ . Then, W ⃗ d `Qι , ξπ θ ˘ď W ⃗ d `ξ π θ , T ˘`L ξπ P . We therefore define the W 2 AE-MDP (Wasserstein-Wasserstein auto-encoded MDP) objective as: min ι,θ E s,a,s 1 "ξπ s,a,s 1 "ϕιp¨|s,a,s 1 q " d S ps, G θ psqq `dA pa, ψ θ ps, aqq `dS `s1 , G θ `s1 ˘˘‰ `Lξπ R `β ¨pW ξπ `Lξπ P q, Algorithm 1: Wasserstein 2 Auto-Encoded MDP Input: batch size N , max. step T , no. of regularizer updates m, penalty coefficient δ ą 0 for t " 1 to T do for i " 1 to N do Sample a transition s i , a i , r i , s 1 i from the original environment via ξ π Embed the transition into the latent space by drawing s i , a i , s 1 i from ϕ ι p¨| s i , a i , s 1 i q Make the latent space model transition to the next latent state: s ‹ i " P θ p¨| s i , a i q Sample a latent transition from ξπ θ : z i " ξπ θ , a 1 i " π θ p¨| z i q, and z 1 i " P θ p¨| z i , a 1 i q W Ð ř N i"1 φ ξ ω ps i , a i , s ‹ i q ´φξ ω pz i , a 1 i , z 1 i q `φP ω ps i , a i , s i , a i , s 1 i q ´φP ω ps i , a i , s i , a i , s ‹ i q P Ð ř N i"1 GP `φξ ω , xs i , a i , s ‹ i y , xz i , a 1 i , z 1 i y ˘`GP `x Þ Ñ φ P ω ps i , a i , s i , a i , xq, s 1 i , s ‹ i Ȗpdate the Lipschitz networks parameters ω by ascending 1 {N ¨pβ W ´δ P q if t mod m " 0 then L Ð ř N i"1 d S ps i , G θ ps i qq `dA pa i , ψ θ ps i , a i qq `dR `ri , R θ ps i , a i q ˘`d S ps 1 i , G θ ps 1 i qq Update the latent space model parameters xι, θy by descending 1 {N ¨pL `β Wq function GPpφ ω , x, yq Ź Gradient penalty for φ ω : R n Ñ R and x, y P R n ϵ " U p0, 1q; x Ð ϵx `p1 ´ϵqy Ź random noise; straight lines between x and y return p}∇ x φ ω p xq} ´1q 2 where W ξπ " W ⃗ d `T , ξπ θ ˘and L ξπ P are respectively called steady-state and transition regularizers. The former allows to quantify the distance between the stationary distributions respectively induced by π in M and π θ in M θ , further enabling the distillation. The latter allows to learn the latent dynamics. Note that L ξπ R and L ξπ P -set over ξ π instead of ξ π θ -are not sufficient to ensure the bisimulation bounds (Eq. 1): running π in M θ depends on the parallel execution of π in the original model, which does not permit its (conventional) verification. Breaking this dependency is enabled by learning the distillation π θ through W ξπ , as shown in Fig. 1b : minimizing W ξπ allows to make ξ π and ξπ θ closer together, further bridging the gap of the discrepancy between π and π θ . At any time, recovering the local losses along with the linked bisimulation bounds in the objective function of the W 2 AE-MDP is allowed by considering the latent policy resulting from this distillation: Theorem 3.3. Assume that traces are generated by running a latent policy π P Π in the original environment and let d R be the usual Euclidean distance, then the W 2 AE-MDP objective is min ι,θ E s,s 1 "ξπ " d S ps, G θ pϕ ι psqqq `dS `s1 , G θ `ϕι `s1 ˘˘˘‰ `Lξπ R `β ¨pW ξπ `Lξπ P q. Optimizing the regularizers is enabled by the dual form of the OT: we introduce two parameterized networks, φ ξ ω and φ P ω , constrained to be 1-Lipschitz and trained to attain the supremum of the dual: W ξπ pωq " max ω E s,a"ξπ E a"ϕ A ι p¨|ϕιpsq,aq E s ‹ "P θ p¨|ϕιpsq,aq φ ξ ω pϕ ι psq, a, s ‹ q ´E z,a 1 ,z 1 " ξπ θ φ ξ ω `z, a 1 , z 1 Lξπ P pωq " max ω E s,a,s 1 "ξπ E s,a,s 1 "ϕιp¨|s,a,s 1 q " φ P ω `s, a, s, a, s 1 ˘´E s ‹ "P θ p¨|s,aq φ P ω ps, a, s, a, s ‹ q ı Details to derive this tractable form of L ξπ P pωq are in Appendix A.5. The networks are constrained via the gradient penalty approach of Gulrajani et al. (2017) , leveraging that any differentiable function is 1-Lipschitz iff it has gradients with norm at most 1 everywhere (we show in Appendix A.6 this is still valid for relaxations of discrete spaces). The final learning process is presented in Algorithm 1.

3.2. DISCRETE LATENT SPACES

To enable the verification of latent models supported by the bisimulation guarantees of Eq. 1, we focus on the special case of discrete latent space models. Our approach relies on continuous relaxation of discrete random variables, regulated by some temperature parameter(s) λ: discrete random variables are retrieved as λ Ñ 0, which amounts to applying a rounding operator. For training, we use the temperature-controlled relaxations to differentiate the objective and let the gradient flow through the network. When we deploy the latent policy in the environment and formally check the latent model, the zero-temperature limit is used. An overview of the approach is depticted in Fig. 2 . ϕ ι s,s ′ → G θ ŝ, ŝ′ P θ R θ s → a → r s, s′ → ϕ A ι s → a → a . . . one-hot encoding s → a → ψ θ â . . . 1 0 1 eval. training λ → 0 binary encoding λ → 0 .2 .8 .3 .1 s, z → a, a ′ → π θ z → evaluation training 0 { ⇒ ℓ(s) = ℓ(s) ⇒ ℓ(s ′ ) = ℓ(s ′ ) .2 .8 . . . 1 0 l, l ′ λ → 0 Hλ φ P ω φ P ω L ξπ P r L ξπ R 0 1 0 0 . . . s′ → s⋆ → ξπθ 0 0 . . . 0 0 MAF φ ξ ω φ ξ ω Wξ π z label logits s⋆ z ′ a ′ s → a → s⋆ → MAF 1 s, s′ Gumbel softmax logistic sampling logistic sampling Gumbel softmax ∼ ∼ ∼ ∼ z → a ′ → z ′ → State encoder. We work with a binary representation of the latent states. First, this induces compact networks, able to deal with a large discrete space via a tractable number of parameter variables. But most importantly, this ensures that Assumption 2.1 is satisfied: let n " log 2 |S|, we reserve |AP| bits in S and each time s P S is passed to ϕ ι , n ´|AP| bits are produced and concatenated with ℓpsq, ensuring a perfect reconstruction of the labels and further bisimulation bounds. To produce Bernoulli variables, ϕ ι deterministically maps s to a latent code z, passed to the Heaviside Hpzq " 1 zą0 . We train ϕ ι by using the smooth approximation H λ pzq " σp 2z {λq, satisfying H " lim λÑ0 H λ . Latent distributions. Besides the discontinuity of their latent image space, a major challenge of optimizing over discrete distributions is sampling, required to be a differentiable operation. We circumvent this by using concrete distributions (Jang et al., 2017; Maddison et al., 2017) : the idea is to sample reparameterizable random variables from λ-parameterized distributions, and applying a differentiable, nonlinear operator in downstream. We use the Gumbel softmax trick to sample from distributions over (one-hot encoded) latent actions (ϕ A ι , π θ ). For binary distributions (P θ , ξπ θ ), each relaxed Bernoulli with logit α is retrieved by drawing a logistic random variable located in α {λ and scaled to 1 {λ, then applying a sigmoid in downstream. We emphasize that this trick alone (as used by Corneil et al. 2018; Delgrange et al. 2022) is not sufficient: it yields independent Bernoullis, being too restrictive in general, which prevents from learning sound transition dynamics (cf. Example 1). Example 1. Let M be the discrete MC of Fig. 3 . In one-hot, AP " tgoal : x1, 0y , unsafe : x0, 1yu. We assume that 3 bits are used for the (binary) state space, with S " ts 0 : x0, 0, 0y , s 1 : x1, 0, 0y , s 2 : x0, 1, 0y , s 3 : x0, 1, 1yu (the two first bits are reserved for the labels). Considering each bit as being independent is not sufficient to learn P: the optimal estimation P θ ‹ p¨| s 0 q is in that case represented by the independent Bernoulli vector b " x 1 {2, 1 {2, 1 {4y, giving the probability to go from s 0 to each bit independently. This yields a poor estimation of the actual transition function: P θ ‹ ps 0 | s 0 q " p1´b 1 q¨p1´b 2 q¨p1´b 3 q " P θ ‹ ps 1 | s 0 q " b 1 ¨p1b 2 q¨p1´b 3 q " P θ ‹ ps 2 | s 0 q " p1´b 1 q¨b 2 ¨p1´b 3 q " 3 {16, P θ ‹ ps 3 | s 0 q " p1´b 1 q¨b 2 ¨b3 " 1 {16. We consider instead relaxed multivariate Bernoulli distributions by decomposing P P ∆ `S˘a s a product of conditionals: P psq " ś n i"1 P ps i | s 1 : i´1 q where s i is the i th entry (bit) of s. We learn such distributions by introducing a masked autoregressive flow (MAF, Papamakarios et al. 2017) for relaxed Bernoullis via the recursion: s i " σp li`αi {λq, where l i " Logisticp0, 1q, α i " f i ps 1 : i´1 q, and f is a MADE (Germain et al., 2015) , a feedforward network implementing the conditional output dependency on the inputs via a mask that only keeps the necessary connections to enforce the conditional property. We use this MAF to model P θ and the dynamics related to the labels in ξπ θ . We fix the logits of the remaining n ´|AP| bits to 0 to allow for a fairly distributed latent space.

4. EXPERIMENTS

We evaluate the quality of latent space models learned and policies distilled through W 2 AE-MDPs . To do so, we first trained deep-RL policies (DQN, Mnih et al. 2015 on discrete, and SAC, Haarnoja et al. 2018 on continuous action spaces) for various OpenAI benchmarks (Brockman et al., 2016) , which we then distill via our approach (Figure 4 ). We thus evaluate (a) the W 2 AE-MDP training metrics, (b) the abstraction and representation quality via PAC local losses upper bounds (Delgrange et al., 2022) , and (c) the distilled policy performance when deployed in the original environment. The confidence metrics and performance are compared with those of VAE-MDPs. Finally, we formally verify properties in the latent model. The exact setting to reproduce our results is in Appendix B. Learning metrics. The objective (Fig. 4a ) is a weighted sum of the reconstruction loss and the two Wasserstein regularizers. The choice of β defines the optimization direction. In contrast to VAEs (cf. Appendix C), WAEs indeed naturally avoid posterior collapse (Tolstikhin et al., 2018) , indicating that the latent space is consistently distributed. Optimizing the objective (Fig. 4a ) effectively allows minimizing the local losses (Fig. 4b ) and recovering the performance of the original policy (Fig. 4c ). Local losses. For V-and WAEs, we formally evaluate PAC upper bounds on L ξπ θ R and L ξπ θ P via the algorithm of Delgrange et al. (2022) (Fig 4b) . The lower the local losses, the closer M and M θ are in terms of behaviors induced by π θ (cf. Eq. 1). In VAEs, the losses are evaluated on a transition function P obtained via frequency estimation of the latent transition dynamics (Delgrange et al., 2022) , by reconstructing the transition model a posteriori and collecting data to estimate the transition probabilities (e.g., Bazille et al. 2020; Corneil et al. 2018) . We thus also report the metrics for P. Our bounds quickly converge to close values in general for P θ and P, whereas for VAEs, the convergence is slow and unstable, with P offering better bounds. We emphasize that WAEs do not require this additional reconstruction step to obtain losses that can be leveraged to assess the quality of the model, in contrast to VAEs, where learning P θ was performed via overly restrictive distributions, leading to poor estimation in general (cf. Ex. 1). Finally, when the distilled policies offer comparable performance (Fig. 4c ), our bounds are either close to or better than those of VAEs. Distillation. The bisimulation guarantees (Eq. 1) are only valid for π θ , the policy under which formal properties can be verified. It is crucial that π θ achieves performance close to π, the original one, when deployed in the RL environment. We evaluate the performance of π θ via the undiscounted episode return R π θ obtained by running π θ in the original model M. We observe that R π θ approaches faster the original performance R π for W-than VAEs: WAEs converge in a few steps for all environments, whereas the full learning budget is sometimes necessary with VAEs. The success in recovering the original performance emphasizes the representation quality guarantees (Eq. 1) induced by WAEs: when local losses are minimized, all original states that are embedded to the same representation are bisimilarly close. Distilling the policy over the new representation, albeit discrete and hence coarser, still achieves effective performance since ϕ ι keeps only what is important to preserve behaviors, and thus values. Furthermore, the distillation can remove some non-robustness obtained during RL: π θ prescribes the same actions for bisimilarly close states, whereas this is not necessarily the case for π. Formal verification. To formally verify M θ , we implemented a value iteration (VI) engine, handling the neural network encoding of the latent space for discounted properties, which is one of the most popular algorithms for checking property probabilities in MDPs (e.g., Baier & Katoen 2008; Hensel et al. 2021; Kwiatkowska et al. 2022) . We verify time-to-failure properties φ, often used to check the failure rate of a system (Pnueli, 1977) by measuring whether the agent fails before the end of the episode. Although simple, such properties highlight the applicability of our approach on reachability events, which are building blocks to verify MDPs (Baier & Katoen 2008; cf. Appendix B.7) . In particular, we checked whether the agent reaches an unsafe position or angle (CartPole, LunarLander), does not reach its goal position (MountainCar, Acrobot), and does not reach and stay in a safe region of the system (Pendulum). Results are in Table 1 : for each environment, we select the distilled policy which gives the best trade-off between performance (episode return) and abstraction quality (local losses). As extra confidence metric, we report the value difference }V π θ } " |V π θ ps I q ´V π θ ps I q| obtained by executing π θ in M and M θ (V π θ p¨q is averaged while V π θ p¨q is formally computed).

5. CONCLUSION

We presented WAE-MDPs, a framework for learning formally verifiable distillations of RL policies with bisimulation guarantees. The latter, along with the learned abstraction of the unknown continuous environment to a discrete model, enables the verification. Our method overcomes the limitations of VAE-MDPs and our results show that it outperforms the latter in terms of learning speed, model quality, and performance, in addition to being supported by stronger learning guarantees. As mentioned by Delgrange et al. (2022) , distillation failure reveals the lack of robustness of original RL policies. In particular, we found that distilling highly noise-sensitive RL policies (such as robotics simulations, e.g., Todorov et al. 2012 ) is laborious, even though the result remains formally verifiable. We demonstrated the feasibility of our approach through the verification of reachability objectives, which are building blocks for stochastic model-checking (Baier & Katoen, 2008) . Besides the scope of this work, the verification of general discounted ω-regular properties is theoretically allowed in our model via the rechability to components of standard constructions based on automata products (e.g., Baier et al. 2016; Sickert et al. 2016) , and discounted games algorithms (Chatterjee et al., 2010) . Beyond distillation, our results, supported by Thm. 3.3, suggest that our WAE-MDP can be used as a general latent space learner for RL, further opening possibilities to combine RL and formal methods online when no formal model is a priori known, and address this way safety in RL with guarantees.

REPRODUCIBILITY STATEMENT

We referenced in the main text the Appendix parts presenting the proofs or additional details of every claim, Assumption, Lemma, and Theorem occurring in the paper. In addition, Appendix B is dedicated to the presentation of the setup, hyperparameters, and other extra details required for reproducing the results of Section 4. We provide the source code of the implementation of our approach in Supplementary materialfoot_0 , and we also provide the models saved during training that we used for model checking (i.e., reproducing the results of Table 1 ). Additionally, we present in a notebook (evaluation.html) videos demonstrating how our distilled policies behave in each environment, and code snippets showing how we formally verified the policies. 

APPENDIX A THEORETICAL DETAILS ON WAE-MDPS

A.1 THE DISCREPANCY MEASURE We show that reasoning about discrepancy measures between stationary distributions is sound in the context of infinite interaction and episodic RL processes. Let P θ be a parameterized behavioral model that generate finite traces from the original environment (i.e., finite sequences of state, actions, and rewards of the form xs 0:T , a 0:T ´1, r 0:T ´1y), our goal is to find the best parameter θ which offers the most accurate reconstruction of the original traces issued from the original model M operating under π. We demonstrate that, in the limit, considering the OT between trace-based distributions is equivalent to considering the OT between the stationary distribution of M π and the one of the behavioral model. Let us first formally recall the definition of the metric on the transitions of the MDP. Raw transition distance. Assume that S, A, and ImpRq are respectively equipped with metric d S , d A , and d R , let us define the raw transition distance metric over transitions of M, i.e., tuples of the form xs, a, r, s 1 y, as ⃗ d : S ˆA ˆImpRq ˆS, ⃗ d `@s 1 , a 1 , r 1 , s 1 1 D , @ s 2 , a 2 , r 2 , s 1 2 D˘" d S ps 1 , s 2 q `dA pa 1 , a 2 q `dR pr 1 , r 2 q `dS `s1 1 , s 1 2 ˘. In a nutshell, ⃗ d consists of the sum of the distance of all the transition components. Note that it is a well defined distance metric since the sum of distances preserves the identity of indiscernible, symmetry, and triangle inequality. Trace-based distributions. The raw distance ⃗ d allows to reason about transitions, we thus consider the distribution over transitions which occur along traces of length T to compare the dynamics of the original and behavioral models: D π rT s `s, a, r, s 1 ˘" 1 T T ÿ t"1 ξ t π ps | s I q ¨πpa | sq ¨P`s 1 | s, a ˘¨1 r"Rps,aq , and P θ rT s `s, a, r, s 1 ˘" 1 T T ÿ t"1 E s0:t,a0:t´1,r0:t´1"P θ rts 1 xst´1,at´1rt´1,sty"xs,a,r,s 1 y , where P θ rT s denotes the distribution over traces of length T , generated from P θ . Intuitively, 1 {T řT t"1 ξ t π ps | s I q can be seen as the fraction of the time spent in s along traces of length T , starting from the initial state Kulkarni (1995) . Therefore, drawing xs, a, r, s 1 y " D π rT s trivially follows: it is equivalent to drawing s from 1 {T ¨řT t"1 ξ t π p¨| s I q, then respectively a and s 1 from πp¨| sq and Pp¨| s, aq, to finally obtain r " Rps, aq. Given T P N, our objective is to minimize the Wasserstein distance between those distributions: W ⃗ d pD π rT s, P θ rT sq. The following Lemma enables optimizing the Wasserstein distance between the original MDP and the behavioral model when traces are drawn from episodic RL processes or infinite interactions (Huang, 2020) . Lemma A.1. Assume the existence of a stationary behavioral model ξ θ " lim T Ñ8 P θ rT s, then lim T Ñ8 W ⃗ d pD π rT s, P θ rT sq " W ⃗ d pξ π , ξ θ q . Proof. First, note that 1 {T ¨řT t"1 ξ t π p¨| s I q weakly converges to ξ π as T goes to 8 Kulkarni (1995). The result follows then from (Villani, 2009 , Corollary 6.9).

A.2 DEALING WITH DISCRETE ACTIONS

When the policy π executed in M already produces discrete actions, learning a latent action space is, in many cases, not necessary. We thus make the following assumptions: Assumption A.2. Let π : S Ñ ∆pA ‹ q be the policy executed in M and assume that A ‹ is a (tractable) finite set. Then, we take A " A ‹ and ϕ A ι as the identity function, i.e., ϕ A ι : S ˆA‹ Ñ A ‹ , xs, a ‹ y Þ Ñ a ‹ . Assumption A.3. Assume that the action space of the original environment M is a (tractable) finite set. Then, we take ψ θ as the identity function, i.e., ψ θ " ϕ A ι . Concretely, the premise of Assumption A.2 typically occurs when π is a latent policy (see Rem. 1) or when M has already a discrete action space. In the latter case, Assumption A.2 and A.3 amount to setting A " A and ignoring the action encoder and embedding function. Note that if a discrete action space is too large, or if the user explicitly aims for a coarser space, then the former is not considered as tractable, these assumptions do not hold, and the action space is abstracted to a smaller set of discrete actions. A.3 PROOF OF LEMMA 3.2 Notation. From now on, we write ϕ ι ps, a | s, aq " 1 ϕιpsq"s ¨ϕA ι pa | s, aq. Lemma 3.2. Define T ps, a, s 1 q " Es,a"ξ π r1 ϕιpsq"s ¨ϕA ι pa | s, aq¨P θ ps 1 | s, aqs as the distribution of drawing state-action pairs from interacting with M, embedding them to the latent spaces, and finally letting them transition to their successor state in M θ . Then, W ⃗ d `Qι , ξπ θ ˘ď W ⃗ d `ξ π θ , T ˘`L ξπ P . Proof. Wasserstein is compliant with the triangular inequality (Villani, 2009) , which gives us: W ⃗ d `Qι , ξπ θ ˘ď W ⃗ d pQ ι , T q `Wd S `T , ξπ θ ˘, where We pass from Eq. 3 to Eq. 4 by the Jensen's inequality. To see how we pass from Eq. 4 to Eq. 5, notice that W ⃗ d `T , ξπ θ ˘(note that W ⃗ d is reflexive (Villani, 2009)) " sup f PF ⃗ d E s,a"ξπ E s,a"ϕιp¨|s,aq E s 1 "P θ p¨|s,aq f `s, a, s 1 ˘´E s" ξπ θ E a"π θ p¨|sq E s 1 "P θ p¨|s,aq f `s, a, s 1 ˘, and W ⃗ d pQ ι , T q " sup f PF ⃗ d E s,a,s 1 "ξπ E s,a,s 1 "ϕιp¨|s,a,s 1 q f `s, a, s 1 ˘´E s,a"ξπ E s,a"ϕιp¨|s,aq E s 1 "P θ p¨|s,aq f `s, a, s 1 ˘(3) ď E s, F ⃗ d " ! f : f `s1 , a 1 , s 1 1 ˘´f `s2 , a 2 , s 1 2 ˘ď ⃗ d `@s 1 , a 1 , s 1 1 D , @ s 2 , a 2 , s 1 2 D˘) F ⃗ d " t f : f `s1 , a 1 , s 1 1 ˘´f `s2 , a 2 , s 1 2 ˘ď d S ps 1 , s 2 q `dA pa 1 , a 2 q `dS `s1 1 , s 1 2 ˘u Observe now that s and a are fixed in the supremum computation of Eq. 4: all functions f considered and taken from F ⃗ d are of the form f ps, a, ¨q. It is thus sufficient to consider the supremum over functions from the following subset of F ⃗ d : t f : f `s, a, s 1 1 ˘´f `s, a, s 1 2 ˘ď d S ps, sq `dA pa, aq `dS `s1 1 , s 1 2 ˘u (for s, a drawn from ϕ ι ) " t f : f `s, a, s 1 1 ˘´f `s, a, s 1 2 ˘ď d S `s1 1 , s 1 2 ˘u " t f : f `s1 1 ˘´f `s1 2 ˘ď d S `s1 1 , s 1 2 ˘u "F d S . Given a state s P S in the original model, the (parallel) execution of π in M θ is enabled through πpa, a | sq " πpa | sq ¨ϕA ι pa | ϕ ι psq, aq (cf. Fig. 1b ). The local transition loss resulting from this interaction is: A.4 PROOF OF THEOREM 3.3 L ξπ P " E s, Before proving Theorem 3.3, let us introduce the following Lemma, that explicitly demonstrates the link between the transition regularizer of the W 2 AE-MDP objective and the local transition loss required to obtain the guarantees related to the bisimulation bounds of Eq. 1. Lemma A.4. Assume that traces are generated by running π P Π in the original environment, then E s,a ‹ "ξπ E a"ϕ A ι p¨|ϕιpsq,a ‹ q W d S `ϕι Pp¨| s, a ‹ q, P θ p¨| ϕ ι psq, aq ˘" L ξπ P . Proof. Since the latent policy π generates latent actions, Assumption A.2 holds, which means: E s,a ‹ "ξπ E a"ϕ A ι p¨|ϕιpsq,a ‹ q W d S `ϕι Pp¨| s, a ‹ q, P θ p¨| ϕ ι psq, aq " E s,a"ξπ W d S ` ϕι Pp¨| s, aq, P θ p¨| ϕ ι psq, aq " L ξπ P . Theorem 3.3. Assume that traces are generated by running a latent policy π P Π in the original environment and let d R be the usual Euclidean distance, then the W 2 AE-MDP objective is min ι,θ E s,s 1 "ξπ " d S ps, G θ pϕ ι psqqq `dS `s1 , G θ `ϕι `s1 ˘˘˘‰ `Lξπ R `β ¨pW ξπ `Lξπ P q. Proof. We distinguish two cases: (i) the case where the original and latent models share the same discrete action space, i.e., A " A, and (ii) the case where the two have a different action space (e.g., when the original action space is continuous), i.e., A ‰ A. In both cases, the local losses term follows by definition of L ξπ R and Lemma A.4. When d R is the Euclidean distance (or even the L 1 distance since rewards are scalar values), the expected reward distance occurring in the expected trace-distance term ⃗ d in the W 2 AE-MDP objective directly translates to the local loss L ξπ R . Concerning the local transition loss, in case (i), the result naturally follows from Assumption A.2 and A.3. In case (ii), only Assumption A.2 holds, meaning the action encoder term of the W 2 AE-MDP objective is ignored, but not the action embedding term appearing in G θ . Given s " ξ π , recall that executing π in M amounts to embedding the produced latent actions a " πp¨| ϕ ι psqq back to the original environment via a " ψ θ pϕ ι psq, aq (cf. Rem. 1 and Fig. 1a ). Therefore, the projection of ⃗ dpxs, a, r, s 1 y , G θ pϕ ι psq, a, ϕ ι ps 1 qqq on the action space A is d A pψ θ pϕ ι psq, aq, ψ θ pϕ ι psq, aqq " 0, for r " Rps, aq and s 1 " Pp¨| s, aq.

A.5 OPTIMIZING THE TRANSITION REGULARIZER

In the following, we detail how we derive a tractable form of our transition regularizer L ξπ P pωq. Optimizing the ground Kantorovich-Rubinstein duality is enabled via the introduction of a parameterized, 1-Lipschitz network φ P ω , that need to be trained to attain the supremum of the dual: L ξπ P pωq " E s,a"ξπ E s,a"ϕιp¨|s,aq max ω : φ P ω PF d S E s 1 "ϕιPp¨|s,aq φ P ω `s1 ˘´E s 1 "P θ p¨|s,aq φ P ω `s1 ˘. Under this form, optimizing L ξπ P pωq is intractable due to the expectation over the maximum. The following Lemma allows us rewriting L ξπ P to make the optimization tractable through Monte Carlo estimation. Lemma A.5. Let X , Y be two measurable sets, ξ P ∆pX q, P : X Ñ ∆pYq, Q : X Ñ ∆pYq, and d : Y ˆY Ñ r0, `8r be a metric on Y. Then, E x"ξ W d pP p¨| xq, Qp¨| xqq " sup φ : X ÑF d E x"ξ " E y1"P p¨|xq φpxqpy 1 q ´E y2"Qp¨|xq φpxqpy 2 q ȷ Proof. Our objective is to show that E x"ξ « sup f PF d E y1"P p¨|xq φpy 1 qpxq ´E y2"Qp¨|xq φpy 2 qpxq ff (6) " sup φ : X ÑF d E x"ξ " E y1"P p¨|xq φpxqpy 1 q ´E y2"Qp¨|xq φpxqpy 2 q ȷ (7) We start with ( 6) ď (7). Construct φ ‹ : X Ñ F d by setting for all x P X φ ‹ pxq " arg sup f PF d E y1"P p¨|xq f py 1 q ´E y2"Qp¨|xq f py 2 q. This gives us E x"ξ « sup f PF d E y1"P p¨|xq f py 1 q ´E y2"Qp¨|xq f py 2 q ff " E x"ξ " E y1"P p¨|xq φ ‹ pxqpy 1 q ´E y2"Qp¨|xq φ ‹ pxqpy 2 q ȷ ď sup φ : X ÑF d E x"ξ " E y1"P p¨|xq φpxqpy 1 q ´E y2"Qp¨|xq φpxqpy 2 q ȷ . It remains to show that ( 6) ě (7). Take φ ‹ " arg sup φ : X ÑF d E x"ξ " E y1"P p¨|xq φpxqpy 1 q ´E y2"Qp¨|xq φpxqpy 2 q ȷ . Then, for all x P X , we have φ ‹ pxq P F d which means: E y1"P p¨|xq φ ‹ pxqpy 1 q ´E y2"Qp¨|xq φ ‹ pxqpy 2 q ď sup f PF d E y1"P p¨|xq f py 1 q ´E y2"Qp¨|xq f py 2 q This finally yields E x"ξ " E y1"P p¨|xq φ ‹ pxqpy 1 q ´E y2"Qp¨|xq φ ‹ pxqpy 2 q ȷ ď E x"ξ « sup f PF d E y1"P p¨|xq f py 1 q ´E y2"Qp¨|xq f py 2 q ff . Corollary A.5.1. Let ξ π be a stationary distribution of M π and X " S ˆA ˆS ˆA, then L ξπ P " sup φ : X ÑF d S E s,a,s 1 "ξπ E s,a"ϕιp¨|s,aq « φps, a, s, aq `ϕι `s1 ˘˘´E s 1 "P θ p¨|s,aq φps, a, s, aq `s1 ˘ff Consequently, we rewrite L ξπ P pωq as a tractable maximization: L ξπ P pωq " max ω : φ P ω PF d S E s,a,s 1 "ξπ E s,a"ϕιp¨|s,aq « φ P ω `s, a, s, a, ϕ ι `s1 ˘˘´E s 1 "P θ p¨|s,aq φ P ω `s, a, s, a, s 1 ˘ff .

A.6 THE LATENT METRIC

In the following, we show that considering the Euclidean distance for ⃗ d and d S in the latent space for optimizing the regularizers W ξπ and L ξπ P is Lipschitz equivalent to considering a continuous λ-relaxation of the discrete metric 1 ‰ px, yq " 1 x‰y . Consequently, this also means it is consistently sufficient to enforce 1-Lispchitzness via the gradient penalty approach of Gulrajani et al. (2017) during training to maintain the guarantees linked to the regularizers in the zero-temperature limit, when the spaces are discrete. Lemma A.6. Let d be the usual Euclidean distance and d λ : r0, 1s n ˆr0, 1s n Ñ r0, 1r, xx, yy Þ Ñ dpx,yq λ`dpx,yq for λ P s0, 1s and n P N, then d λ is a distance metric. Proof. The function d λ is a metric iff it satisfies the following axioms: 1. Identity of indiscernibles: If x " y, then d λ px, yq " dpx,yq λ`dpx,yq " 0 λ`0 " 0 since d is a distance metric. Assume now that d λ px, yq " 0 and take α " dpx, yq, for any x, y. Thus, α P r0, `8r and 0 " α λ`α is only achieved in α " 0, which only occurs whenever x " y since d is a distance metric. 2. Symmetry:  d λ px, Since d is a distance metric, we have λ 2 dpx, yq `λ2 dpy, zq ě λ 2 dpx, zq and Impdq P r0, 8r, meaning 2λdpx, yqdpy, zq `dpx, yqdpy, zqdpx, zq ě 0 By Eq. 10 and 11, the inequality of Eq. 9 holds. Furthermore, the fact that Eq. 8 and 9 are equivalent yields the result. Taking a " 1 λ`?n and b " 1 λ yields the result. Corollary A.7.1. For all β ě 1 {λ, s P S, a P A, s P S, and a P A, we have 1. W d λ `T , ξπ θ ˘ď β ¨Wd `T , ξπ θ 2. W d λ `ϕι Pp¨| s, aq, P θ p¨| s, aq ˘ď β ¨Wd `ϕι Pp¨| s, aq, P θ p¨| s, aq Proof. By Lipschitz equivalence, taking β ě 1 {λ ensures that @n P N, @x, y P r0, 1s n , d λ px, yq ď β ¨dpx, yq. Moreover, for any distributions P, Q, W d λ pP, Qq ď β ¨Wd pP, Qq (cf., e.g., Gelada et al. 2019, Lemma A.4 for details). In practice, taking the hyperparameter β ě 1 {λ in the W 2 AE-MDP ensures that minimizing the β-scaled regularizers w.r.t. d also minimizes the regularizers w.r.t. the λ-relaxation d λ , being the discrete distribution in the zero-temperature limit. Note that optimizing over two different β 1 , β 2 instead of a unique scale factor β is also a good practice to interpolate between the two regularizers.

B EXPERIMENT DETAILS

The code for conducting and replicating our experiments is available at https://github.com/ florentdelgrange/wae_mdp.

B.1 SETUP

We used TENSORFLOW 2.7.0 (Abadi et al., 2015) to implement the neural network architecture of our W 2 AE-MDP , TENSORFLOW PROBABILITY 0.15.0 (Dillon et al., 2017) to handle the probabilistic components of the latent model (e.g., latent distributions with reparameterization tricks, masked autoregressive flows, etc.), as well as TF-AGENTS 0.11.0 (Guadarrama et al., 2018) to handle the RL parts of the framework. Models have been trained on a cluster running under CentOS Linux 7 (Core) composed of a mix of nodes containing Intel processors with the following CPU microarchitectures: (i) 10-core INTEL E5-2680v2, (ii) 14-core INTEL E5-2680v4, and (iii) 20-core INTEL Xeon Gold 6148. We used 8 cores and 32 GB of memory for each run.

B.2 STATIONARY DISTRIBUTION

To sample from the stationary distribution ξ π of episodic learning environments operating under π P Π, we implemented the recursive ϵ-perturbation trick of Huang (2020) . In a nutshell, the reset of the environment is explicitly added to the state space of M, which is entered at the end of each episode and left with probability 1 ´ϵ to start a new one. We also added a special atomic proposition reset into AP to label this reset state and reason about episodic behaviors. For instance, this allows verifying whether the agent behaves safely during the entire episode, or if it is able to reach a goal before the end of the episode.

B.3 ENVIRONMENTS WITH INITIAL DISTRIBUTION

Many environments do not necessarily have a single initial state, but rather an initial distribution over states d I P ∆pSq. In that case, the results presented in this paper remain unchanged: it suffices to add a dummy state s ‹ to the state space S Y t s ‹ u so that s I " s ‹ with the transition dynamics Pps 1 | s ‹ , aq " d I ps 1 q for any action a P A. Therefore, each time the reset of the environment is triggered, we make the MDP entering the initial state s ‹ , then transitioning to s 1 according to d I .

B.4 LATENT SPACE DISTRIBUTION

As pointed out in Sect. 4, posterior collapse is naturally avoided when optimizing W 2 AE-MDP . To illustrate that, we report the distribution of latent states produced by ϕ ι during training (Fig. 5 ). The plots reveal that the latent space generated by mapping original states drawn from ξ π during training to S via ϕ ι is fairly distributed, for each environment. The usual Euclidean distance is often a good choice for all the transition components, but the scale, dimensionality, and nature of the inputs sometimes require using scaled, normalized, or other kinds of distances to allow the network to reconstruct each component. While we did not observe such requirements in our experiments (where we simply used the Euclidean distance), high dimensional observations (e.g., images) are an example of data which could require tuning the state-distance function in such a way, to make sure that the optimization of the reward or action reconstruction will not be disfavored compared to that of the states.

B.6 VALUE DIFFERENCE

In addition to reporting the quality guarantees of the model along training steps through local losses (cf. Figure 4b ), our experiments revealed that the absolute value difference }V π θ } between the original and latent models operating under the latent policy quickly decreases and tends to converge to values in the same range (Figure 6 ). This is consistent with the fact that minimizing local losses lead to close behaviors (cf. Eq. 1) and that the value function is Lipschitz-continuous w.r.t. d " π θ (cf. Section 2).

B.7 REMARK ON FORMAL VERIFICATION

Recall that our bisimulation guarantees come by construction of the latent space. Essentially, our learning algorithm spits out a distilled policy and a latent state space which already yields a guaranteed bisimulation distance between the original MDP and the latent MDP. This is the crux of how we enable verification techniques like model checking. In particular, bisimulation guarantees mean that reachability probabilities in the latent MDP compared to those in the original one are close. Furthermore, the value difference of (omega-regular) properties (formulated through mu-calculus) obtained in the two models is bounded by this distance (cf. Sect. 2 and Chatterjee et al. 2010) . Reachability is the key ingredient to model-check MDPs. Model-checking properties is in most cases performed by reduction to the reachability of components or regions of the MDP: it either consists of (i) iteratively checking the reachability of the parts of the state space satisfying path formulae that comprise the specification, through a tree-like decomposition of the latter (e.g., for (P,R-)CTL properties, cf. Baier & Katoen 2008) , or (ii) checking the reachability to the part of the state space of a product of the MDP with a memory structure or an automaton that embeds the omega-regular property -e.g., for LTL (Baier et al., 2016; Sickert et al., 2016) , LTLf (Wells et al., 2020 ), or GLTL (Littman et al., 2017) , among other specification formalisms. The choice of specification formalism is up to the user and depends on the case study. The scope of this work is focusing on learning to distill RL policies with bisimulation guarantees so that model checking can be applied, in order to reason about the behaviors of the agent. That being said, reachability is all we need to show that model checking can be applied.

B.8 HYPERPARAMETERS

Wfoot_1 AE-MDP parameters. All components (e.g., functions or distribution locations and scales, see Fig. 2 ) are represented and inferred by neural networks (multilayer perceptrons). All the networks share the same architecture (i.e., number of layers and neurons per layer). We use a simple uniform experience replay of size 10 6 to store the transitions and sample them. The training starts when the agent has collected 10 4 transitions in M. We used minibatches of size 128 to optimize the objective and we applied a minibatch update every time the agent executing π has performed 16 steps in M. We use the recursive ϵ-perturbation trick of Huang (2020) with ϵ " 3 {4: when an episode ends, it restarts from the initial state with probability 1 {4; before re-starting an episode, the time spent in the reset state labeled with reset follows then the geometric distribution with expectation ϵ {1´ϵ " 3. We chose the same latent state-action space size than Delgrange et al. (2022) , except for LunarLander that we decreased to log 2 ˇˇS ˇˇ" 14 and ˇˇA ˇˇ" 3 to improve the scalability of the verification. VAE-MDPs parameters. For the comparison of Sect. 4, we used the exact same VAE-MDP hyperparameter set as prescribed by Delgrange et al. (2022) , except for the state-action space of LunarLander that we also changed for scalability and fair comparison purpose. 2 Hyperparameter search. To evaluate our W 2 AE-MDP , we realized a search in the parameter space defined in Table 2 . The best parameters found (in terms of trade-off between performance and latent quality) are reported in Table 3 . We used two different optimizers for minimizing the loss (referred to as the minimizer) and computing the Wasserstein terms (reffered to as the maximizer). We used ADAM (Kingma & Ba, 2015) for the two, but we allow for different learning rates ADAM α and exponential decays ADAM β1 , ADAM β2 . We also found that polynomial decay for ADAM α (e.g., to 10 ´5 for 4 ¨10 5 steps) is a good practice to stabilize the experiment learning curves, but is not necessary to obtain high-quality and performing distillation. Concerning the continuous relaxation of discrete distributions, we used a different temperature for each distribution, as Maddison et al. (2017) pointed out that doing so is valuable to improve the results. We further followed the guidelines of Maddison et al. (2017) to choose the interval of temperatures and did not schedule any annealing scheme (in contrast to VAE-MDPs). Essentially, the search reveals that the regularizer scale factors β ¨(defining the optimization direction) as well as the encoder and latent transition temperatures are important to improve the performance of distilled policies. For the encoder temperature, we found a nice spot in λ ϕι " 2 {3, which provides the best performance in general, whereas the choice of λ P θ and β ¨are (latent-) environment dependent. The importance of the temperature parameters for the continuous relaxation of discrete distributions is consistent with the results of (Maddison et al., 2017) , revealing that the success of the relaxation depends on the choice of the temperature for the different latent space sizes. Labeling functions. We used the same labeling functions as those described by Delgrange et al. (2022) . For completeness, we recall the labeling function used for each environment in Table 4 . Let θ 1 , θ 2 P r0, 2πs be the angles of the two rotational joints, P θ t 0.1, 1 {3, 1 {2, 2 {3, 3 {5, 0.99 u λ ϕι t 0.1, 1 {3, 1 {2, 2 {3, 3 {5, 0.99 u λ π θ t 1 {|A|´1, 1 {p|A|´1q¨1.5 u λ ϕ A ι t 1 {|A|´1, 1 {p|A|´1q¨1.5 u 0 0 0 0.5 λ P θ 1 {3 1 {3 0.1 0.75 2 {3 λ ϕι 1 {3 2 {3 2 {3 2 {3 2 {3 λ π θ 2 {3 1 {3 0.5 0.5 0.5 λ ϕ A ι / / / 1 {3 1 {3 Environment S Ď • s • LunarLander: φ " ␣SafeLanding U Reset, where SafeLanding " GroundContact MotorsOff, GroundContact " x0, 1, 0, 0, 0, 0, 0y, and MotorsOff " x0, 0, 0, 0, 0, 1, 0y 1 " cospθ 1 q • s 2 " sinpθ 1 q • s 3 " cospθ 2 q • s 4 " sinpθ 2 q • s 5 : angular velocity 1 • s 6 : angular velocity 2 • p 1 " 1 ´s1´s3¨s1`s4¨s2ą1 : RL agent target • p 2 " 1 s1ě0 : θ 1 P r0, π {2s Y r 3π {2, 2πs • p 3 " 1 s2ě0 : θ 1 P r0, πs • p 4 " 1 s3ě0 : θ 2 P r0, π {2s Y r 3π • Pendulum: φ " ♢p␣Safe ^⃝Resetq, where Safe " x1, 0, 0, 0, 0y, ♢T " ␣H U T, and s i |ù ⃝T iff s i`1 |ù T, for any T Ď AP, s i:8 , a i:8 P Traj . Intuitively, φ denotes the event of ending an episode in an unsafe state, just before resetting the environment, which means that either the agent never reached the safe region or it reached and left it at some point. Formally, φ " t s 0:8 , a 0:8 | Di P N, s i |ù Safe ^si`1 |ù Reset u Ď Traj .

C ON THE CURSE OF VARIATIONAL MODELING

Posterior collapse is a well known issue occurring in variational models (see, e.g., Alemi et al. 2018; Tolstikhin et al. 2018; He et al. 2019; Dong et al. 2020) which intuitively results in a degenerate local optimum where the model learns to ignore the latent space and use only the reconstruction functions (i.e., the decoding distribution) to optimize the objective. VAE-MDPs are no exception, as pointed out in the original paper (Delgrange et al., 2022, Section 4.3 and Appendix C.2) . function for learning the latent space model -the so-called evidence lower bound (Hoffman et al., 2013; Kingma & Welling, 2014) , or ELBO for short -and set up annealing schemes to eventually recover the ELBO at the end of the training process. Consequently, the resulting learning procedure focuses primarily on fairly distributing the latent space, to avoid it to collapse to a single latent state, to the detriment of learning the dynamics of the environment and the distillation of the RL policy. Then, the annealing scheme allows to make the model learn to finally smoothly use the latent space to maximize the ELBO, and achieve consequently a lower distortion at the "price" of a higher rate. Impact of the resulting learning procedure. The aforementioned annealing process, used to avoid that every state collapses to the same representation, possibly induces a high entropy embedding function (Fig. 7d ), which further complicates the learning of the model dynamics and the distillation in the first stage of the training process. In fact, in this particular case, one can observe that the entropy reaches its maximal value, which yields a fully random state embedding function. Recall that the VAE-MDP latent space is learned through independent Bernoulli distributions. Fig. 7d reports values centered around 4.188 in the first training phase, which corresponds to the entropy of the state embedding function when ϕ ι p¨| sq is uniformly distributed over S for any state s P S: Hpϕ ι p¨| sqq " řlog 2 |S|´|AP|"6 i"0 ´pi log p i ´p1 ´pi q logp1 ´pi q " 4.188, where p i " 1 {2 for all i. The rate (Fig. 7b ) drops to zero since the divergence pulls the latent dynamics towards this high entropy (yet another form of posterior collapse), which hinders the latent space model to learn a useful representation. However, the annealing scheme increases the rate importance along training steps, which enables the optimization to eventually leave this local optimum (here around 4 ¨10 5 training steps). This allows the learning procedure to leave the zero-rate spot, reduce the distortion (Fig. 7c ), and finally distill the original policy (Fig. 7e ). As a result, the whole engineering required to mitigate posterior collapse slows down the training procedure. This phenomenon is reflected in Fig. 4 : VAE-MDPs need several steps to stabilize and set up the stage to the concrete optimization, whereas WAE-MDPs have no such requirements since they naturally do not suffer from collapsing issues (cf. Fig. 5 ), and are consequently faster to train. Lack of representation guarantees. On the theoretical side, since VAE-MDPs are optimized via the ELBO and the local losses via the related variational proxies, VAE-MDPs do not leverage the representation quality guarantees induced by local losses (Eq. 1) during the learning procedure (as explicitly pointed out by Delgrange et al., 2022, Sect. 4.1.) : in contrast to WAE-MDPs, when two original states are embedded to the same latent, abstract state, the former are not guaranteed to be bisimilarly close (i.e., the agent is not guaranteed to behave the same way from those two states by executing the policy), meaning those proxies do not prevent original states having distant values collapsing together to the same latent representation. 



available at https://github.com/florentdelgrange/wae_mdp The code for conducting the VAE-MDPs experiments is available at https://github.com/ florentdelgrange/vae_mdp (GNU General Public License v3.0). In fact, the phenomenon of collapsing to few state occurs for all the environments considered in this paper when their prioritized experience replay is not used, as illustrated inDelgrange et al., 2022, Appendix C.2.



Execution of the latent policy π in the original and latent MDPs, and local losses. Parallel execution of the original RL policy π in the original and latent MDPs, local losses, and steady-state regularizer.

Figure 1: Latent flows: arrows represent (stochastic) mappings, the original (resp. latent) state-action space is spread along the blue (resp. green) area, and distances are depicted in red. Distilling π into π via flow (b) by minimizing W ξπ allows closing the gap between flows (a) and (b).

Figure 2: W 2 AE-MDP architecture. Distances are depicted by red dotted lines.

Figure 3: Markov Chain with four states; labels are drawn next to their state.

Figure 4: For each environment, we trained five different instances of the models with different random seeds: the solid line is the median and the shaded interval the interquartile range.

Let d, d λ as defined above, then (i) d λ Ý ÝÝ Ñ λÑ0 1 ‰ and (ii) d, d λ are Lipschitz-equivalent. Proof. Part (i) is straightforward by definition of d λ . Distances d and d λ are Lispchitz equivalent if and only if Da, b ą 0 such that @x, y P r0, 1s n , a ¨dpx, yq ď d λ px, yq ď b ¨dpx, yq "a ¨dpx, yq ď dpx, yq λ `dpx, yq ď b ¨dpx, yq " a ď 1 λ `dpx, yq ď b

Figure 5: Latent space distribution along training steps. The intensity of the blue hue corresponds to the frequency of latent states produced by ϕ ι during training.

Figure 6: Absolute value difference }V π θ } reported along training steps.

Formal Verification of distilled policies. Values are computed for γ " 0.99 (lower is better).

Amy Zhang, Rowan Thomas McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=-2FCwDKRREu.

Hyperparameter search. λ X refers to the temperature used for W 2 AE-MDP component X. use ε-mimic (cf. Delgrange et al. 2022) t True, False u (if True, a decay rate of 10 ´5 is used) λ

Final hyperparameters used to evaluate W 2 AE-MDPs in Sect. 4

Description, for s P S ℓpsq " xp 1 , . . . , p n , p reset y

" 1 s1ěcosp π {3q : safe joint angle • p 2 " 1 s1ě0 : θ P r0, π {2s Y r 3π {2, 2πs • p 3 " 1 s2ě0 : θ P r0, πs • p 4 " 1 s3ě0 : positive angular velocity

Labeling functions for the OpenAI environments considered in our experiments(Delgrange et al., 2022). We provide a short description of the state space and the meaning of each atomic proposition. Recall that labels are binary encoded, for n " |AP| ´1 (one bit is reserved for reset) and p reset " 1 iff s is a reset state (cf. Appendix B.2). Based on the labeling described in Table4, we formally detail the time to failure properties checked in Sect. 4 whose results are listed in Table1for each environment. Let Reset " t reset u " x0, . . . , 1y (we assume here that the last bit indicates whether the current state is a reset state or not) and define s |ù L

1 rconds indicator function: 1 if the statement [cond] is true, and 0 otherwise F d Set of 1-Lipschitz functions w.r.t. the distance metric d σ Sigmoid function, with σpxq " 1 {1`expp´xq f θ A function f θ : X Ñ R modeled by a neural network, parameterized by θ, where X is any measurable set Latent Space Model M " @ S, A, P, R, ℓ, AP, s I D Latent MDP with state space S, action space A, reward function R, labeling function ℓ, atomic proposition space AP, and initial state s I . Action embedding function, from S ˆA to A ϕP Distribution of drawing s 1 " Pp¨| s, aq, then embedding s 1 " ϕps 1 q, for any state s P S and action a P A L ξ " xS, A, P, R, ℓ, AP, s I y MDP M with state space S, action space A, transition function P, labeling function ℓ, atomic proposition space AP, and initial state s I . Limiting distribution of the MDP defined as ξ t π ps 1 | sq " P Ms π `t s 0:8 , a 0:8 | s t " s 1 u ˘, for any source state s P S Π Set of memoryless policies of M π Memoryless policy π : S Ñ ∆pAq Unique probability measure induced by the policy π in M on the Borel σ-algebra over measurable subsets of Traj C U T Constrained reachability event M s MDP obtained by replacing the initial state of M by s P S s State in S ξ π Stationary distribution of M induced by the policy π ⃗ d Raw transition distance, i.e., metric over S ˆA ˆImpRq ˆS Traj Set of infinite trajectories of M τ " xs 0:T , a 0:T ´1y Trajectory V πValue function for the policy π Probability / Measure TheoryDDiscrepancy measure; DpP, Qq is the discrepancy between distributions P, Q P ∆pX q ∆pX q Set of measures over a complete, separable metric space X Logisticpµ, sq Logistic distribution with location parameter µ and scale parameter s W d Wasserstein distance w.r.t. the metric d; W d pP, Qq is the Wasserstein distance between distributions P, Q P ∆pX q Wasserstein Auto-encoded MDP ξ Marginal encoding distribution over S ˆA ˆS : Es,a,s 1 "ξπ ϕ ι p¨| s, a, s 1 q ξπ θ Stationary distribution of the latent model M θ , parameterized by θ Distribution of drawing state-action pairs from interacting with M, embedding them to the latent spaces, and finally letting them transition to their successor state in M θ , in ∆ `S ˆA ˆSφ

ACKNOWLEDGMENTS

This research received funding from the Flemish Government (AI Research Program) and was supported by the DESCARTES iBOF project. G.A. Perez is also supported by the Belgian FWO "SAILor" project (G030020N). We thank Raphael Avalos for his valuable feedback during the preparation of this manuscript.

annex

While the former clearly fails to learn a useful latent representation, the later does so meticulously and smoothly in two distinguishable phases: first, ϕ ι focuses on fairly distributing the latent space, setting up the stage to the concrete optimization occurring from step 4 ¨10 5 , where the entropy of ϕ ι is lowered, which allows to get the rate of the variational model away from zero. Five instances of the models are trained with different random seeds, with the same hyperparameters than in Sect. 4.Formally, VAE-and WAE-MDPs optimize their objective by minimizing two losses: a reconstruction cost plus a regularizer term which penalizes a discrepancy between the encoding distribution and the dynamics of the latent space model. In VAE-MDPs, the former corresponds to the the distortion, and the later to the rate of the variational model (further details are given in Alemi et al. 2018; Delgrange et al. 2022) , while in our WAE-MDPs, the former corresponds to the raw transition distance and the later to both the steady-state and transition regularizers. Notably, the rate minimization of VAE-MDPs involves regularizing a stochastic embedding function ϕ ι p¨| sq point-wise, i.e., for all different input states s P S drawn from the interaction with the original environment. In contrast, the latent space regularization of the WAE-MDP involves the marginal embedding distribution Q ι where the embedding function ϕ ι is not required to be stochastic. Alemi et al. (2018) showed that posterior collapse occurs in VAEs when the rate of the variational model is close to zero, leading to low-quality representation.Posterior collapse in VAE-MDPs. We illustrate the sensitivity of VAE-MDPs to the posterior collapse problem in Fig. 7 , through the CartPole environment 3 : minimizing the distortion and the rate as is yields an embedding function which maps deterministically every input state to the same sink latent state (cf. Fig. 7a ). Precisely, there is a latent state s P S so that ϕ ι ps | sq « 1 and P θ ps | s, aq « 1 whatever the state s P S and action a P A. This is a form of posterior collapse, the resulting rate quickly drops to zero (cf. Fig 7b) , and the resulting latent representation yields no information at all. This phenomenon is handled in VAE-MDPs by using (i) prioritized replay buffers that allow to focus on inputs that led to bad representation, and (ii) modifying the objective

