WASSERSTEIN AUTO-ENCODED MDPS FORMAL VERIFICATION OF EFFICIENTLY DISTILLED RL POLICIES WITH MANY-SIDED GUARANTEES

Abstract

Although deep reinforcement learning (DRL) has many success stories, the largescale deployment of policies learned through these advanced techniques in safetycritical scenarios is hindered by their lack of formal guarantees. Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling formally verifiable controllers from any RL policy. While the related guarantees address relevant practical aspects such as the satisfaction of performance and safety properties, the VAE approach suffers from several learning flaws (posterior collapse, slow learning speed, poor dynamics estimates), primarily due to the absence of abstraction and representation guarantees to support latent optimization. We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent space model that fixes those issues by minimizing a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy, for which the formal guarantees apply. Our approach yields bisimulation guarantees while learning the distilled policy, allowing concrete optimization of the abstraction and representation model quality. Our experiments show that, besides distilling policies up to 10 times faster, the latent model quality is indeed better in general. Moreover, we present experiments from a simple time-to-failure verification algorithm on the latent space. The fact that our approach enables such simple verification techniques highlights its applicability.

1. INTRODUCTION

Reinforcement learning (RL) is emerging as a solution of choice to address challenging real-word scenarios such as epidemic mitigation and prevention strategies (Libin et al., 2020) , multi-energy management (Ceusters et al., 2021) , or effective canal control (Ren et al., 2021) . RL enables learning high performance controllers by introducing general nonlinear function approximators (such as neural networks) to scale with high-dimensional and continuous state-action spaces. This introduction, termed deep-RL, causes the loss of the conventional convergence guarantees of RL (Tsitsiklis, 1994) as well as those obtained in some continuous settings (Nowe, 1994) , and hinders their wide roll-out in critical settings. This work enables the formal verification of any such policies, learned by agents interacting with unknown, continuous environments modeled as Markov decision processes (MDPs). Specifically, we learn a discrete representation of the state-action space of the MDP, which yield both a (smaller, explicit) latent space model and a distilled version of the RL policy, that are tractable for model checking (Baier & Katoen, 2008) . The latter are supported by bisimulation guarantees: intuitively, the agent behaves similarly in the original and latent models. The strength of our approach is not simply that we verify that the RL agent meets a predefined set of specifications, but rather provide an abstract model on which the user can reason and check any desired agent property. Variational MDPs (VAE-MDPs, Delgrange et al. 2022) offer a valuable framework for doing so. The distillation is provided with PAC-verifiable bisimulation bounds guaranteeing that the agent behaves similarly (i) in the original and latent model (abstraction quality); (ii) from all original states embedded to the same discrete state (representation quality). Whilst the bounds offer a confidence metric that enables the verification of performance and safety properties, VAE-MDPs suffer from several learning flaws. First, training a VAE-MDP relies on variational proxies to the bisimulation bounds, meaning there is no learning guarantee on the quality of the latent model via its optimization. Second, variational autoencoders (VAEs) (Kingma & Welling, 2014; Hoffman et al., 2013) are known to suffer from posterior collapse (e.g., Alemi et al. 2018) resulting in a deterministic mapping to a unique latent state in VAE-MDPs. Most of the training process focuses on handling this phenomenon and setting up the stage for the concrete distillation and abstraction, finally taking place in a second training phase. This requires extra regularizers, setting up annealing schemes and learning phases, and defining prioritized replay buffers to store transitions. Distillation through VAE-MDPs is thus a meticulous task, requiring a large step budget and tuning many hyperparameters. Building upon Wasserstein autoencoders (Tolstikhin et al., 2018) instead of VAEs, we introduce Wasserstein auto-encoded MDPs (WAE-MDPs), which overcome those limitations. Our WAE relies on the optimal transport (OT) from trace distributions resulting from the execution of the RL policy in the real environment to that reconstructed from the latent model operating under the distilled policy. In contrast to VAEs which rely on variational proxies, we derive a novel objective that directly incorporate the bisimulation bounds. Furthermore, while VAEs learn stochastic mappings to the latent space which need be determinized or even entirely reconstructed from data at the deployment time to obtain the guarantees, our WAE has no such requirements, and learn all the necessary components to obtain the guarantees during learning, and does not require such post-processing operations. Those theoretical claims are reflected in our experiments: policies are distilled up to 10 times faster through WAE-than VAE-MDPs and provide better abstraction quality and performance in general, without the need for setting up annealing schemes and training phases, nor prioritized buffer and extra regularizer. Our distilled policies are able to recover (and sometimes even outperform) the original policy performance, highlighting the representation quality offered by our new framework: the distillation is able to remove some non-robustness of the input RL policy. Finally, we formally verified time-to-failure properties (e.g., Pnueli 1977) to emphasize the applicability of our approach. Other Related Work. Complementary works approach safe RL via formal methods (Junges et al., 2016; Alshiekh et al., 2018; Jansen et al., 2020; Simão et al., 2021) , aimed at formally ensuring safety during RL, all of which require providing an abstract model of the safety aspects of the environment. They also include the work of Alamdari et al. (2020) , applying synthesis and model checking on policies distilled from RL, without quality guarantees. Other frameworks share our goal of verifying deep-RL policies (Bacci & Parker, 2020; Carr et al., 2020) but rely on a known environment model, among other assumptions (e.g., deterministic or discrete environment). Finally, DeepSynth (Hasanbeig et al., 2021) allows learning a formal model from execution traces, with the different purpose of guiding the agent towards sparse and non-Markovian rewards. On the latent space training side, WWAEs (Zhang et al., 2019) reuse OT as latent regularizer discrepancy (in Gaussian closed form), whereas we derive two regularizers involving OT. These two are, in contrast, optimized via the dual formulation of Wasserstein, as in Wassertein-GANs (Arjovsky et al., 2017) . Similarly to VQ-VAEs (van den Oord et al., 2017) and Latent Bernoulli AEs (Fajtl et al., 2020) , our latent space model learns discrete spaces via deterministic encoders, but relies on a smooth approximation instead of using the straight-through gradient estimator. Works on representation learning for RL (Gelada et al., 2019; Castro et al., 2021; Zhang et al., 2021; Zang et al., 2022) consider bisimulation metrics to optimize the representation quality, and aim at learning (continuous) representations which capture bisimulation, so that two states close in the representation are guaranteed to provide close and relevant information to optimize the performance of the controller. In particular, as in our work, DeepMDPs (Gelada et al., 2019) are learned by optimizing local losses, by assuming a deterministic MDP and without verifiable confidence measurement.

2. BACKGROUND

In the following, we write ∆pX q for the set of measures over (complete, separable metric space) X . Markov decision processes (MDPs) are tuples M " xS, A, P, R, ℓ, AP, s I y where S is a set of states; A, a set of actions; P : S ˆA Ñ ∆pSq, a probability transition function that maps the current state and action to a distribution over the next states; R : S ˆA Ñ R, a reward function; ℓ : S Ñ 2 AP , a labeling function over a set of atomic propositions AP; and s I P S, the initial state. If |A| " 1, M is a fully stochastic process called a Markov chain (MC). We write M s for the MDP obtained when replacing the initial state of M by s P S. An agent interacting in M produces trajectories, i.e., sequences of states and actions τ " xs 0:T , a 0:T ´1y where s 0 " s I and s t`1 " Pp¨| s t , a t q for t ă T . The set of infinite trajectories of M is Traj . We assume AP and

