SAFE REINFORCEMENT LEARNING FROM PIXELS USING A STOCHASTIC LATENT REPRESENTATION

Abstract

We address the problem of safe reinforcement learning from pixel observations. Inherent challenges in such settings are (1) a trade-off between reward optimization and adhering to safety constraints, (2) partial observability, and (3) highdimensional observations. We formalize the problem in a constrained, partially observable Markov decision process framework, where an agent obtains distinct reward and safety signals. To address the curse of dimensionality, we employ a novel safety critic using the stochastic latent actor-critic (SLAC) approach. The latent variable model predicts rewards and safety violations, and we use the safety critic to train safe policies. Using well-known benchmark environments, we demonstrate competitive performance over existing approaches regarding computational requirements, final reward return, and satisfying the safety constraints.

1. INTRODUCTION

As reinforcement learning (RL) algorithms are increasingly applied in the real-world (Mnih et al., 2015; Jumper et al., 2021; Fu et al., 2021) , their safety becomes ever more important with the increase of both model complexity, and uncertainty. Considerable effort has been made to increase the safety of RL (Liu et al., 2021) . However, major challenges remain that prevent the deployment of RL in the real-world (Dulac-Arnold et al., 2021) . Most approaches to safe RL are limited to fully observable settings, neglecting issues such as noisy or imprecise sensors. Moreover, realistic environments exhibit high-dimensional observation spaces and are largely out of reach for the stateof-the-art. In this work, we present an effective safe RL approach that handles partial observability with high-dimensional observation spaces in the form of pixel observations. In tandem with prior work, we formalize the safety requirements using a constrained Markov decision process (CMDP; Altman, 1999). The objective is to learn a policy that maximizes a reward while constraining the expected return of a scalar cost signal below a certain value (Achiam et al., 2017) . According to the reward hypothesis, it could be possible to encode safety requirements directly in the reward signal. However, as argued by Ray et al. (2019) , safe RL based only on a scalar reward carries the issue of designing a suitable reward function. In particular, balancing the tradeoff between reward optimization and safety within a single reward is a difficult problem. Moreover, over-engineering rewards to complex safety requirements runs the risk of triggering negative side effects that surface after integration into broader system operation (Abbeel & Ng, 2005; Amodei et al., 2016) . Constrained RL addresses this issue via a clear separation of reward and safety. Reinforcement learning from pixels typically suffers from sample inefficiency, as it requires many interactions with the environment. In the case of safe RL, improving the sample efficiency is especially crucial as each interaction with the environment, before the agent reaches a safe policy, has an opportunity to cause harm (Zanger et al., 2021) . Moreover, to ensure safety, there is an incentive to act pessimistic with regard to the cost (As et al., 2022) . This conservative assessment of safety, in turn, may yield a lower reward performance than is possible within the safety constraints. Our contribution. We propose Safe SLAC, an extension of the Stochastic Latent Actor Critic approach (SLAC; Lee et al., 2020) to problems with safety constraints. SLAC learns a stochastic latent variable model of the environment dynamics, to address the fact that optimal policies in partially ob-

