RECONNAISSANCE FOR REINFORCEMENT LEARNING WITH SAFETY CONSTRAINTS Anonymous

Abstract

Practical reinforcement learning problems are often formulated as constrained Markov decision process (CMDP) problems, in which the agent has to maximize the expected return while satisfying a set of prescribed safety constraints. In this study, we consider a situation in which the agent has access to the generative model which provides us with a next state sample for any given state-action pair, and propose a model to solve a CMDP problem by decomposing the CMDP into a pair of MDPs; reconnaissance MDP (R-MDP) and planning MDP (P-MDP). In R-MDP, we train threat function, the Q-function analogue of danger that can determine whether a given state-action pair is safe or not. In P-MDP, we train a reward-seeking policy while using a fixed threat function to determine the safeness of each action. With the help of generative model, we can efficiently train the threat function by preferentially sampling rare dangerous events. Once the threat function for a baseline policy is computed, we can solve other CMDP problems with different reward and different danger-constraint without the need to re-train the model. We also present an efficient approximation method for the threat function that can greatly reduce the difficulty of solving R-MDP. We will demonstrate the efficacy of our method over classical approaches in benchmark dataset and complex collision-free navigation tasks.

1. INTRODUCTION

With recent advances in reinforcement learning (RL), it is becoming possible to learn complex reward-maximizing policy in an increasingly more complex environment (Mnih et al., 2015; Silver et al., 2016; Andrychowicz et al., 2018; James et al., 2018; Kalashnikov et al., 2018) . However, it is difficult in general to assess whether the policies found by a given RL algorithm is physically safe when applied to real world situations. This has long been one of the greatest challenges in the application of reinforcement learning to mission-critical systems. In a popular setup, one assumes a Markovian system together with a predefined measure of danger, and formulates the problem as a type of constrained Markov decision process (CMDP) problem. That is, based on the classical RL notations in which π represents a policy of the agent, we aim to solve max π E π [R(h)] s.t. E π [D(h)] ≤ c, ( ) where h is a trajectory of state-action pairs, R(h) is the total return that can be obtained by h, and D(h) is the measure of how dangerous the trajectory h is. To solve this problem, one must monitor the value of E π [D(h)] throughout the training. Methods like (Altman, 1999; Geibel & Wysotzki, 2005; Geibel, 2006; Achiam et al., 2017a; Chow et al., 2018; 2019) uses sampling to approximate E π [D(h)] or its Lyapunov function at every update. However, the sample-based evaluation of the E π [D(h)] is particularly difficult when the system involves "rare" catastrophic accidents, because an immense number of samples will be required to collect information about the cause of such accident. This problem can be partially resolved if we can use a generative model to predict the outcome of any given sequence of actions and initial state. Model Predictive Control (MPC) (Maciejowski, 2002; Falcone et al., 2007; Wang & Boyd, 2010; Di Cairano et al., 2013; Weiskircher et al., 2017) uses the philosophy of receding horizon and predicts the future outcome of actions in order to determine what action the agent should take in the next step. If the future-horizon to consider is sufficiently short and the dynamics is deterministic, the prediction can often be approximated well by linear dynamics, 1

