GUIDED SAFE SHOOTING: MODEL BASED REINFORCE-MENT LEARNING WITH SAFETY CONSTRAINTS

Abstract

In the last decade, reinforcement learning successfully solved complex control tasks and decision-making problems, like the Go board game. Yet, there are few success stories when it comes to deploying those algorithms to real-world scenarios. One of the reasons is the lack of guarantees when dealing with and avoiding unsafe states, a fundamental requirement in critical control engineering systems. In this paper, we introduce Guided Safe Shooting (GuSS), a model-based RL approach that can learn to control systems with minimal violations of the safety constraints. The model is learned on the data collected during the operation of the system in an iterated batch fashion, and is then used to plan for the best action to perform at each time step. We propose three different safe planners, one based on a simple random shooting strategy and two based on MAP-Elites, a more advanced divergent-search algorithm. Experiments show that these planners help the learning agent avoid unsafe situations while maximally exploring the state space, a necessary aspect when learning an accurate model of the system. Furthermore, compared to modelfree approaches, learning a model allows GuSS reducing the number of interactions with the real-system while still reaching high rewards, a fundamental requirement when handling engineering systems.

1. INTRODUCTION

Figure 1 : An illustrative example of planning with model-based approach on the Acrobot environment. The agent controls the torque on the first joint with the goal of getting its end effector as high as possible, avoiding the unsafe zone (red area). Starting in the rest position (left) the agent uses its model to find the best plan (middle) that will maximize the reward while satisfying the safety constraint and execute it on the real system (right). The example is especially relevant to applications in which safety and reward are traded off. In recent years, deep Reinforcement Learning (RL) solved complex sequential decision-making problems in a variety of domains, such as controlling robots, and video and board games (Mnih et al., 2015; Andrychowicz et al., 2020; Silver et al., 2016) . However, in the majority of these cases, success is limited to a simulated world. The application of these RL solutions to real-world systems is still yet to come. The main reason for this gap is the fundamental principle of RL of learning by trial and error to maximize a reward signal (Sutton & Barto, 2018) . This framework requires unlimited access to the system to explore and perform actions possibly leading to undesired outcomes. This is not always possible. For example, considering the task of finding the optimal control strategy for a data center cooling problem (Lazic et al., 2018) , the RL algorithm could easily take actions leading to high temperatures during the learning process, affecting and potentially breaking the 1

