GUIDED SAFE SHOOTING: MODEL BASED REINFORCE-MENT LEARNING WITH SAFETY CONSTRAINTS

Abstract

In the last decade, reinforcement learning successfully solved complex control tasks and decision-making problems, like the Go board game. Yet, there are few success stories when it comes to deploying those algorithms to real-world scenarios. One of the reasons is the lack of guarantees when dealing with and avoiding unsafe states, a fundamental requirement in critical control engineering systems. In this paper, we introduce Guided Safe Shooting (GuSS), a model-based RL approach that can learn to control systems with minimal violations of the safety constraints. The model is learned on the data collected during the operation of the system in an iterated batch fashion, and is then used to plan for the best action to perform at each time step. We propose three different safe planners, one based on a simple random shooting strategy and two based on MAP-Elites, a more advanced divergent-search algorithm. Experiments show that these planners help the learning agent avoid unsafe situations while maximally exploring the state space, a necessary aspect when learning an accurate model of the system. Furthermore, compared to modelfree approaches, learning a model allows GuSS reducing the number of interactions with the real-system while still reaching high rewards, a fundamental requirement when handling engineering systems.

1. INTRODUCTION

Figure 1 : An illustrative example of planning with model-based approach on the Acrobot environment. The agent controls the torque on the first joint with the goal of getting its end effector as high as possible, avoiding the unsafe zone (red area). Starting in the rest position (left) the agent uses its model to find the best plan (middle) that will maximize the reward while satisfying the safety constraint and execute it on the real system (right). The example is especially relevant to applications in which safety and reward are traded off. In recent years, deep Reinforcement Learning (RL) solved complex sequential decision-making problems in a variety of domains, such as controlling robots, and video and board games (Mnih et al., 2015; Andrychowicz et al., 2020; Silver et al., 2016) . However, in the majority of these cases, success is limited to a simulated world. The application of these RL solutions to real-world systems is still yet to come. The main reason for this gap is the fundamental principle of RL of learning by trial and error to maximize a reward signal (Sutton & Barto, 2018) . This framework requires unlimited access to the system to explore and perform actions possibly leading to undesired outcomes. This is not always possible. For example, considering the task of finding the optimal control strategy for a data center cooling problem (Lazic et al., 2018) , the RL algorithm could easily take actions leading to high temperatures during the learning process, affecting and potentially breaking the system. Another domain where safety is crucial is robotics. Here unsafe actions could not only break the robot but could potentially also harm humans. This issue, known as safe exploration, is a central problem in AI safety (Amodei et al., 2016) . This is why most achievements in RL are in simulated environments, where the agents can explore different behaviors without the risk of damaging the real system. However, those simulators are not always accurate enough, if available at all, leading to suboptimal control strategies when deployed on the real-system (Salvato et al., 2021) . With the long-term goal of deploying RL algorithms on real engineering systems, it is imperative to overcome those limitations. A straightforward way to address this issue is to develop algorithms that can be deployed directly on the real system that provide guarantees in terms of constraints, such as safety, to ensure the integrity of the system. This could potentially have a great impact, as many industrial systems require complex decision-making, which efficient RL systems can easily provide. Going towards this goal, in this paper we introduce Guided Safe Shooting (GuSS), a safe Model Based Reinforcement Learning (MBRL) algorithm that learns a model of the system and uses it to plan for a safe course of actions through Model Predictive Control (MPC) (Garcia et al., 1989) . GuSS learns the model in an iterated batch fashion (Matsushima et al., 2021; Kégl et al., 2021) , allowing for minimal real-system interactions. This is a desirable property for safe RL approaches, as fewer interactions with the real-system mean less chance of entering unsafe states, a condition difficult to attain with model-free safe RL methods (Achiam et al., 2017; Ray et al., 2019b; Tessler et al., 2018) . Moreover, by learning a model of the system, this allows flexibility and safety guarantees as using the model we can anticipate unsafe actions before they occur. Consider the illustrative example in Fig. 1 : the agent, thanks to the model of its dynamics, can perform "mental simulation" and select the best plan to attain its goal while avoiding unsafe zones. This contrasts with many of the methods in the literature that address the problem of finding a safe course of action through Lagrangian optimization or by penalizing the reward function (Webster & Flach, 2021; Ma et al., 2021; Cowen-Rivers et al., 2022) . GuSS avoids unsafe situations by discarding trajectories that are deemed unsafe using the model predictions. Within this framework, we propose three different safe planners, one based on a simple random shooting strategy and two based on MAP-Elites (ME) (Mouret & Clune, 2015) , a more advanced Quality-Diversity (QD) algorithm. These planners are used to generate, evaluate, and select the safest actions with the highest rewards. Using divergent-search methods for planning allows the agent to more widely explore the possible courses of actions. This leads to both a safer and more efficient search, while covering a higher portion of the state space, an important factor when learning a model, given that more exploratory data lead to better models (Yarats et al., 2022) . We test GuSS on three different environments. The presented results highlight how the model and planners can easily find strategies reaching high rewards with minimal costs, even when the two metrics are antithetical, as is the case for the Safe Acrobot environment. To recap, the contributions of the paper are the following: • We introduce Guided Safe Shooting (GuSS), an MBRL method capable of efficiently learning to avoid unsafe states while optimizing the reward; • We propose the use of quality-diversity evolutionary methods as MAP-Elites (ME) as planning techniques in MBRL approaches; • We present 3 different planners, Safe Random Shooting (S-RS), Safe MAP-Elites (S-ME), Pareto Safe MAP-Elites (PS-ME), that can generate a wide array of action sequences while discarding the ones deemed unsafe during planning.

2. RELATED WORK

Some of the most common techniques addressing safety in RL rely on solving a Constrained Markov Decision Process (CMDP) (Altman, 1999) through model-free RL methods (Achiam et al., 2017; Ray et al., 2019b; Tessler et al., 2018; Hsu et al., 2021; Zhang et al., 2020) . Among these approaches, a well-known method is CPO (Achiam et al., 2017) which adds constraints to the policy optimization process in a fashion similar to TRPO (Schulman et al., 2015) . A similar approach is taken by PCPO (Yang et al., 2020b) and its extension (Yang et al., 2020a) . The algorithm works by first optimizing the policy with respect to the reward and then projecting it back on the constraint set in an iterated two-step process. A different strategy consists in storing all the "recovery" actions that the agent took to leave unsafe regions in a separate replay buffer (Hsu et al., 2021) . This buffer is used whenever

