WASSERSTEIN AUTO-ENCODED MDPS FORMAL VERIFICATION OF EFFICIENTLY DISTILLED RL POLICIES WITH MANY-SIDED GUARANTEES

Abstract

Although deep reinforcement learning (DRL) has many success stories, the largescale deployment of policies learned through these advanced techniques in safetycritical scenarios is hindered by their lack of formal guarantees. Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling formally verifiable controllers from any RL policy. While the related guarantees address relevant practical aspects such as the satisfaction of performance and safety properties, the VAE approach suffers from several learning flaws (posterior collapse, slow learning speed, poor dynamics estimates), primarily due to the absence of abstraction and representation guarantees to support latent optimization. We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent space model that fixes those issues by minimizing a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy, for which the formal guarantees apply. Our approach yields bisimulation guarantees while learning the distilled policy, allowing concrete optimization of the abstraction and representation model quality. Our experiments show that, besides distilling policies up to 10 times faster, the latent model quality is indeed better in general. Moreover, we present experiments from a simple time-to-failure verification algorithm on the latent space. The fact that our approach enables such simple verification techniques highlights its applicability.

1. INTRODUCTION

Reinforcement learning (RL) is emerging as a solution of choice to address challenging real-word scenarios such as epidemic mitigation and prevention strategies (Libin et al., 2020) , multi-energy management (Ceusters et al., 2021) , or effective canal control (Ren et al., 2021) . RL enables learning high performance controllers by introducing general nonlinear function approximators (such as neural networks) to scale with high-dimensional and continuous state-action spaces. This introduction, termed deep-RL, causes the loss of the conventional convergence guarantees of RL (Tsitsiklis, 1994) as well as those obtained in some continuous settings (Nowe, 1994) , and hinders their wide roll-out in critical settings. This work enables the formal verification of any such policies, learned by agents interacting with unknown, continuous environments modeled as Markov decision processes (MDPs). Specifically, we learn a discrete representation of the state-action space of the MDP, which yield both a (smaller, explicit) latent space model and a distilled version of the RL policy, that are tractable for model checking (Baier & Katoen, 2008) . The latter are supported by bisimulation guarantees: intuitively, the agent behaves similarly in the original and latent models. The strength of our approach is not simply that we verify that the RL agent meets a predefined set of specifications, but rather provide an abstract model on which the user can reason and check any desired agent property. Variational MDPs (VAE-MDPs, Delgrange et al. 2022) offer a valuable framework for doing so. The distillation is provided with PAC-verifiable bisimulation bounds guaranteeing that the agent behaves similarly (i) in the original and latent model (abstraction quality); (ii) from all original states embedded to the same discrete state (representation quality). Whilst the bounds offer a confidence metric that enables the verification of performance and safety properties, VAE-MDPs suffer from several learning flaws. First, training a VAE-MDP relies on variational proxies to the bisimulation bounds, meaning there is no learning guarantee on the quality of the latent model via its optimization. Second, variational autoencoders (VAEs) (Kingma & Welling, 2014; Hoffman et al., 2013) are known

