SAFE REINFORCEMENT LEARNING FROM PIXELS USING A STOCHASTIC LATENT REPRESENTATION

Abstract

We address the problem of safe reinforcement learning from pixel observations. Inherent challenges in such settings are (1) a trade-off between reward optimization and adhering to safety constraints, (2) partial observability, and (3) highdimensional observations. We formalize the problem in a constrained, partially observable Markov decision process framework, where an agent obtains distinct reward and safety signals. To address the curse of dimensionality, we employ a novel safety critic using the stochastic latent actor-critic (SLAC) approach. The latent variable model predicts rewards and safety violations, and we use the safety critic to train safe policies. Using well-known benchmark environments, we demonstrate competitive performance over existing approaches regarding computational requirements, final reward return, and satisfying the safety constraints.

1. INTRODUCTION

As reinforcement learning (RL) algorithms are increasingly applied in the real-world (Mnih et al., 2015; Jumper et al., 2021; Fu et al., 2021) , their safety becomes ever more important with the increase of both model complexity, and uncertainty. Considerable effort has been made to increase the safety of RL (Liu et al., 2021) . However, major challenges remain that prevent the deployment of RL in the real-world (Dulac-Arnold et al., 2021) . Most approaches to safe RL are limited to fully observable settings, neglecting issues such as noisy or imprecise sensors. Moreover, realistic environments exhibit high-dimensional observation spaces and are largely out of reach for the stateof-the-art. In this work, we present an effective safe RL approach that handles partial observability with high-dimensional observation spaces in the form of pixel observations. In tandem with prior work, we formalize the safety requirements using a constrained Markov decision process (CMDP; Altman, 1999) . The objective is to learn a policy that maximizes a reward while constraining the expected return of a scalar cost signal below a certain value (Achiam et al., 2017) . According to the reward hypothesis, it could be possible to encode safety requirements directly in the reward signal. However, as argued by Ray et al. (2019) , safe RL based only on a scalar reward carries the issue of designing a suitable reward function. In particular, balancing the tradeoff between reward optimization and safety within a single reward is a difficult problem. Moreover, over-engineering rewards to complex safety requirements runs the risk of triggering negative side effects that surface after integration into broader system operation (Abbeel & Ng, 2005; Amodei et al., 2016) . Constrained RL addresses this issue via a clear separation of reward and safety. Reinforcement learning from pixels typically suffers from sample inefficiency, as it requires many interactions with the environment. In the case of safe RL, improving the sample efficiency is especially crucial as each interaction with the environment, before the agent reaches a safe policy, has an opportunity to cause harm (Zanger et al., 2021) . Moreover, to ensure safety, there is an incentive to act pessimistic with regard to the cost (As et al., 2022) . This conservative assessment of safety, in turn, may yield a lower reward performance than is possible within the safety constraints. Our contribution. We propose Safe SLAC, an extension of the Stochastic Latent Actor Critic approach (SLAC; Lee et al., 2020) to problems with safety constraints. SLAC learns a stochastic latent variable model of the environment dynamics, to address the fact that optimal policies in partially ob-servable settings must estimate the underlying state of the environment from the observations. The model predicts the next observation, the next latent state, and the reward based on the current observation and current latent state. This latent state inferred by the model is then used to provide the input for an actor-critic approach (Konda & Tsitsiklis, 1999) . This algorithm involves learning a critic function that estimates the utility of taking a certain action in the environment, which serves as a supervision signal for the policy, also named the actor. The SLAC method has excellent sample efficiency in the safety-agnostic partially observable setting, which renders it a promising candidate to adapt to high-dimensional settings with safety constraints. At its core, SLAC is an actor-critic approach, carrying the potential for a natural extension to safety with a safety critic. We extend SLAC in three ways to create our safe RL approach under partial observability (Safe SLAC): (1) the latent variable model also predicts cost violations, (2) we learn a safety critic that predicts the discounted cost return, and (3) we modify the policy training procedure to optimize a safety-constrained objective by use of a Lagrangian relaxation, solved using dual gradient descent on the primary objective and a Lagrange multiplier to overcome the inherent difficulty of constrained optimization. We evaluate Safe SLAC using a set of benchmark environments introduced by Ray et al. ( 2019). The empirical evaluation shows competitive results compared with complex state-of-the-art approaches.

2. RELATED WORK

Established baseline algorithms for safe reinforcement learning in the fully observable setting include constrained policy optimization (CPO; Achiam et al., 2017) , as well as trust region policy optimization (TRPO)-Lagrangian (Ray et al., 2019) , a cost-constrained variant of the existing trust region policy optimization (TRPO; Schulman et al., 2015) . While TRPO-Lagrangian uses an adaptive Lagrange multiplier to solve the constrained problem with primal-dual optimization, CPO solves the problem of constraint satisfaction analytically during the policy update. The method closest related to ours is Lagrangian model-based agent (LAMBDA; As et al., 2022) , which also addresses the problem of learning a safe policy from pixel observations under high partial observability. LAMBDA uses the partially stochastic dynamics model introduced by (Hafner et al., 2019) . The authors take a Bayesian approach on the dynamics model, sampling from the posterior over parameters to obtain different instantiations of this model. For each instantiation, simulated trajectories are sampled. Then, the worst cost return and best reward return are used to train critic functions that provide a gradient to the policy. LAMBDA shows competitive performance with baseline algorithms, however, there are two major trade-offs. First, by taking a pessimistic approach, the learned policy attains a lower cost return than the allowed cost budget. A less pessimistic approach that uses the entirety of the allowed cost budget may yield a constraint-satisfying policy with a higher reward return. Second, the LAMBDA training procedure involves generating many samples from their variable model to estimate the optimistic/pessimistic temporal difference updates. While the reinforcement learning literature has numerous safety perspectives (García & Fernández, 2015; Pecka & Svoboda, 2014) , we focus on constraining the behavior of the agent on expectation. A method called shielding ensures safety already during training, using temporal logic specifications for safety (Alshiekh et al., 2018; Jansen et al., 2020) . Such methods, however, require extensive prior knowledge in the form of a (partial) model of the environment (Carr et al., 2023) .

3. CONSTRAINED PARTIALLY OBSERVABLE MARKOV DECISION PROCESSES

In reinforcement learning, an agent learns to sequentially interact with an environment to maximize some signal of utility. This problem setting is typically modeled as a Markov decision process (MDP; Sutton & Barto, 2018), in which the environment is composed of a set of states S and a set of actions A. At each timestep t, the agent receives the current environment state s t ∈ S and executes an action a t ∈ A, according to the policy π: a t ∼ π(a t | s t ). This action results in a new state according to the transition dynamics s t+1 ∼ p(s t+1 | s t , a t ) and a scalar reward signal r t = r(s t , a t ) ∈ R, where r is the reward function. The goal is for the agent to learn an optimal policy π ⋆ such that the expectation of discounted, accumulated reward in the environment under that policy is maximized, i.e. π ⋆ = arg max π E [ t γ t r t ] with γ ∈ [0, 1). We use ρ π to denote the distribution over trajectories induced in the environment by a policy π.

