RECONNAISSANCE FOR REINFORCEMENT LEARNING WITH SAFETY CONSTRAINTS Anonymous

Abstract

Practical reinforcement learning problems are often formulated as constrained Markov decision process (CMDP) problems, in which the agent has to maximize the expected return while satisfying a set of prescribed safety constraints. In this study, we consider a situation in which the agent has access to the generative model which provides us with a next state sample for any given state-action pair, and propose a model to solve a CMDP problem by decomposing the CMDP into a pair of MDPs; reconnaissance MDP (R-MDP) and planning MDP (P-MDP). In R-MDP, we train threat function, the Q-function analogue of danger that can determine whether a given state-action pair is safe or not. In P-MDP, we train a reward-seeking policy while using a fixed threat function to determine the safeness of each action. With the help of generative model, we can efficiently train the threat function by preferentially sampling rare dangerous events. Once the threat function for a baseline policy is computed, we can solve other CMDP problems with different reward and different danger-constraint without the need to re-train the model. We also present an efficient approximation method for the threat function that can greatly reduce the difficulty of solving R-MDP. We will demonstrate the efficacy of our method over classical approaches in benchmark dataset and complex collision-free navigation tasks.

1. INTRODUCTION

With recent advances in reinforcement learning (RL), it is becoming possible to learn complex reward-maximizing policy in an increasingly more complex environment (Mnih et al., 2015; Silver et al., 2016; Andrychowicz et al., 2018; James et al., 2018; Kalashnikov et al., 2018) . However, it is difficult in general to assess whether the policies found by a given RL algorithm is physically safe when applied to real world situations. This has long been one of the greatest challenges in the application of reinforcement learning to mission-critical systems. In a popular setup, one assumes a Markovian system together with a predefined measure of danger, and formulates the problem as a type of constrained Markov decision process (CMDP) problem. That is, based on the classical RL notations in which π represents a policy of the agent, we aim to solve max π E π [R(h)] s.t. E π [D(h)] ≤ c, ( ) where h is a trajectory of state-action pairs, R(h) is the total return that can be obtained by h, and D(h) is the measure of how dangerous the trajectory h is. To solve this problem, one must monitor the value of E π [D(h)] throughout the training. Methods like (Altman, 1999; Geibel & Wysotzki, 2005; Geibel, 2006; Achiam et al., 2017a; Chow et al., 2018; 2019) uses sampling to approximate E π [D(h)] or its Lyapunov function at every update. However, the sample-based evaluation of the E π [D(h)] is particularly difficult when the system involves "rare" catastrophic accidents, because an immense number of samples will be required to collect information about the cause of such accident. This problem can be partially resolved if we can use a generative model to predict the outcome of any given sequence of actions and initial state. Model Predictive Control (MPC) (Maciejowski, 2002; Falcone et al., 2007; Wang & Boyd, 2010; Di Cairano et al., 2013; Weiskircher et al., 2017) uses the philosophy of receding horizon and predicts the future outcome of actions in order to determine what action the agent should take in the next step. If the future-horizon to consider is sufficiently short and the dynamics is deterministic, the prediction can often be approximated well by linear dynamics, which can be evaluated instantly. However, because MPC must finish its assessment of the future before taking every action, its performance is limited by the speed of the predictions. When we apply MPC to environments with multiple agents and stochastic dynamics, the computational load of prediction is especially heavy and it can be difficult to finish the prediction in time. MPC requires this computation for each time-step even if the current state is similar to the ones experienced in the past. Meanwhile, if the prediction is done for only a short horizon, MPC may suggest a move to a state leading to a catastrophe. In an effort to reduce the difficulty in evaluating the safeness of policies, we propose a novel generative model-based approach that looks for a solution of a CMDP problem by decomposing the CMDP into a pair of MDPs: a reconnaissance MDP (R-MDP) and planning MDP (P-MDP). The purpose of R-MDP is to (1) recon the state space using the generative model and (2) train a baseline policy for the threat function , which is a Q-function analogue of D. In R-MDP, we use generative model to selectively sample trajectories containing rare dangerous events, and learn the threat function for the baseline policy in the way of supervised learning. Once we obtain a good approximation of the threat function for the baseline policy, we can determine whether a given action is safe at each state or not by just evaluating the threat function. This process does not involve prediction, which can be computationally demanding. We will theoretically show that we can increase the set of safe actions by improving the safeness of the baseline policy. In P-MDP, we train the reward-seeking policy while using the threat function to make sure that unsafe actions are not chosen. We may say that P-MDP is a version of original MDP in which the agents are only allowed to select an action from the set of safe policies defined by the threat function. P-MDP can be solved with standard RL methods like DQN (Mnih et al., 2015) . With our framework, the user is freed from the need of monitoring E π [D] throughout the whole training process. We will also show that our approach enjoys the following useful properties: 1) If we can find a safe baseline policy from the R-MDP problem, the learning of P-MDP will always be safe. 2) So long that we define the danger with the same D function, we can re-use the threat function constructed for one CMDP problem to solve another CMDP problem with a different reward function and different constraint threshold. 3) When dealing with a problem with multiple sources of danger, we can use a basic rule of probability to upper-bound the threat functions by a sum of sub-threat functions, with each summand corresponding to different source of danger each. The property (2) allows us to train an agent that can safely navigate a circuit irrespective of the course-layout1(d). In this experiment, we represented the circuit's wall as a set of point obstacles, and computed the threat functions for the collision with each obstacle point. The property (3) allows us to find a good reward-seeking policy for a sophisticated task like safely navigating through a crowd of randomly moving obstacles. Although our method does not guarantee to find the optimal solution of the CMDP problem, there has not been any study to date that has succeeded in solving a CMDP in dynamical environments as high-dimensional as the ones discussed in this study.

2. PROBLEM FORMULATION AND THEORETICAL RESULTS

We assume that the system in consideration is a discrete-time constrained Markov Decision Process (CMDP) with finite horizon, defined by a tuple (S, A, r, d, P, P 0 ), where S is the set of states, A is the set of actions, P (s |s, a) is the density of the state transition probability from s to s when the



Figure 1: The trajectories produced by the the policy trained by our proposed method ((a) and (d)), 4-step MPC ((b), (e)), and the policy trained with penalized DQN ((c) and (f)). The trajectories on circular circuit were produced by the policies trained on the original circuit. S represents the initial position of the agent. The red marks represents the places at which the agent crashed into the wall.

