FORMAL LANGUAGE CONSTRAINED MARKOV DECISION PROCESSES

Abstract

In order to satisfy safety conditions, an agent may be constrained from acting freely. A safe controller can be designed a priori if an environment is well understood, but not when learning is employed. In particular, reinforcement learned (RL) controllers require exploration, which can be hazardous in safety critical situations. We study the benefits of giving structure to the constraints of a constrained Markov decision process by specifying them in formal languages as a step towards using safety methods from software engineering and controller synthesis. We instantiate these constraints as finite automata to efficiently recognise constraint violations. Constraint states are then used to augment the underlying MDP state and to learn a dense cost function, easing the problem of quickly learning joint MDP/constraint dynamics. We empirically evaluate the effect of these methods on training a variety of RL algorithms over several constraints specified in Safety Gym, MuJoCo, and Atari environments.

1. INTRODUCTION

The ability to impose safety constraints on an agent is key to the deployment of reinforcement learning (RL) systems in real-world environments (Amodei et al., 2016) . Controllers that are derived mathematically typically rely on a full a priori analysis of agent behavior remaining within a predefined envelope of safety in order to guarantee safe operation (Aréchiga & Krogh, 2014) . This approach restricts controllers to pre-defined, analytical operational limits, but allows for verification of safety properties (Huth & Kwiatkowska, 1997) and satisfaction of software contracts (Helm et al., 1990) , which enables their use as a component in larger systems. By contrast, RL controllers are free to learn control trajectories that better suit their tasks and goals; however, understanding and verifying their safety properties is challenging. A particular hazard of learning an RL controller is the requirement of exploration in an unknown environment. It is desirable not only to obey constraints in the final policy, but also throughout the exploration and learning process (Ray et al., 2019) . The goal of safe operation as an optimization objective is formalized by the constrained Markov decision process (CMDP) (Altman, 1999) , which adds to a Markov decision process (MDP) a cost signal similar to the reward signal, and poses a constrained optimization problem in which discounted reward is maximized while the total cost must remain below a pre-specified limit per constraint. We use this framework and propose specifying CMDP constraints in formal languages to add useful structure based on expert knowledge, e.g., building sensitivity to proximity into constraints on object collision or converting a non-Markovian constraint into a Markovian one (De Giacomo et al., 2020) . A significant advantage of specifying constraints with formal languages is that they already form a well-developed basis for components of safety-critical systems (Huth & Kwiatkowska, 1997; Clarke et al., 2001; Kwiatkowska et al., 2002; Baier et al., 2003) and safety properties specified in formal languages can be verified a priori (Kupferman et al., 2000; Bouajjani et al., 1997) . Moreover, the recognition problem for many classes of formal languages imposes modest computational requirements, making them suitable for efficient runtime verification (Chen & Ros ¸u, 2007) . This allows for low-overhead incorporation of potentially complex constraints into RL training and deployment. We propose (1) a method for posing formal language constraints defined over MDP trajectories as CMDP cost functions; (2) augmenting MDP state with constraint automaton state to more explicitly encourage learning of joint MDP/constraint dynamics; (3) a method for learning a dense cost function given a sparse cost function from joint MDP/constraint dynamics; and (4) a method based on . ⇤ (`r) 2 |(r `)2 (note, all unrepresented transitions return to q 0 ). constraint structure to dynamically modify the set of available actions to guarantee the prevention of constraint violations. We validate our methods over a variety of RL algorithms with standard constraints in Safety Gym and hand-built constraints in MuJoCo and Atari environments. 0 start q 1 q 2 q 3 q v q 4 q 5 q 6 n,f r r `r `r The remainder of this work is organized as follows. Section 2 presents related work in CMDPs, using expert advice in RL and safety, as well as formal languages in similar settings. Section 3 describes our definition of a formal language-based cost function, as well as how it's employed in state augmentation, cost shaping, and action shaping. Section 4 details our experimental setup and results and finally, discussion of limitations and future work are located in Section 5. We generalise this work with a learned shaping function in the case of dense soft constraints, and by generalising from reward shaping to other CMDP learning mechanisms. Similar to teacher advice is shielding (Jansen et al., 2018; Alshiekh et al., 2018) , in which an agent's actions are filtered through a shield that blocks actions that would introduce an unsafe state (similar to hard constraints; section 3), but typically requires MDP states to be enumerable and few enough that a shield can be constructed efficiently.

2. RELATED WORK

Formal Languages Formal languages and automata have been used before in RL for task specification or as task abstractions (options) in hierarchical reinforcement learning (Icarte et al., 2018b; Li et al., 2017; Wen et al., 2017; Mousavi et al., 2014) . In some cases, these automata were derived from Linear Temporal Logic (LTL) formulae, in others LTL or other formal language formulae have been directly used to specify tasks (Icarte et al., 2018a 



Figure 1: (a) Illustration of the formal language constraint framework operating through time. State is carried forward through time by both the MDP and the recognizer, D C . (b) No-1D-dithering constraint employed in the Atari and MuJoCo domains:. ⇤ (`r) 2 |(r `)2 (note, all unrepresented transitions return to q 0 ).

Safety and CMDP Framework The CMDP framework doesn't prescribe the exact form of constraints or how to satisfy the constrained optimization problem. Chow et al. (2017) propose conditional value-at-risk of accumulated cost and chance constraints as the values to be constrained and use a Lagrangian formulation to derive a Bellman optimality condition. Dalal et al. (2018) use a different constraint for each MDP state and a safety layer that analytically solves a linearized action correction formulation per state. Similarly, Pham et al. (2018) introduce a layer that corrects the output of a policy to respect constraints on the dynamic of a robotic arm. Teacher Advice A subset of work in safe exploration uses expert advice with potential-based reward shaping mechanisms (Ng et al., 1999). Wiewiora et al. (2003) introduce a general method for incorporating arbitrary advice into the reward structure. Saunders et al. (2017) use a human in the loop to learn an effective RL agent while minimizing cost accumulated over training. Camacho et al. (2017a;b) use DFAs with static reward shaping attached to states to express non-Markovian rewards.

). Littman et al. (2017) defines a modified LTL designed for use in reinforcement learning. In robotics, LTL is used for task learning (Li et al., 2017), sometimes in conjunction with teacher demonstrations (Li et al., 2018). Zhu et al. (2019) and Fulton & Platzer (2019) both study the use of formal languages for safe RL, though each makes assumptions about a prior knowledge of the environment dynamics. Hasanbeig et al. (2018) and Hasanbeig et al. (2020) learn a product MDP with a safety constraint specified in a formal language,

