RISK-AWARE BAYESIAN REINFORCEMENT LEARNING FOR CAUTIOUS EXPLORATION

Abstract

This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL), such that the safety constraint violations are bounded at any point during learning. Whilst enforcing safety during training might limit the agent's exploration, we propose a new architecture that handles the trade-off between efficient progress in exploration and safety maintenance. As the agent's exploration progresses, we update Dirichlet-Categorical models of the transition probabilities of the Markov decision process that describes the agent's behaviour within the environment by means of Bayesian inference. We then propose a way to approximate moments of the agent's belief about the risk associated to the agent's behaviour originating from local action selection. We demonstrate that this approach can be easily coupled with RL, we provide rigorous theoretical guarantees, and we present experimental results to showcase the performance of the overall architecture.

1. INTRODUCTION

Traditionally, RL is principally concerned with the policy that the agent generates by the end of the learning process. In other words, the agent's policy during learning is overlooked to the benefit of learning how to behave optimally. Accordingly, many standard RL methods rely on the assumption that the agent selects each available action at every state infinitely often during exploration (Sutton et al., 2018; Puterman, 2014) . A related technical assumption that is often made is that the MDP is ergodic, meaning that every state is reachable from every other state under proper action selection (Moldovan & Abbeel, 2012) . These assumptions may sometimes be reasonable, e.g., in virtual environments where restarting is always an option. However, in safety-critical systems, these assumptions might be unreasonable, as we may explicitly require the agent to never visit certain unsafe states. Indeed, in a variety of RL applications the safety of the agent is particularly important, e.g. expensive autonomous platforms or robots that work in proximity of humans. Thus, researchers are recently paying increasing attention not only to maximising a long-term task-driven reward, but also to enforcing avoidance of unsafe training.

Related Work

The general problem of Safe RL has been an active area of research in which numerous approaches and definitions of safety have been proposed (Brunke et al., 2021; Garcia & Fernandez, 2015; Pecka & Svoboda, 2014) . In (Moldovan & Abbeel, 2012), safety is defined in terms of ergodicity, with the goal of safety being that an agent is always able to return to its current state after moving away from it. In (Chow et al., 2018a) , safety is pursued by minimising a cost associated with worst-case scenarios, when cost is associated with a lack of safety. Similarly, (Miryoosefi et al., 2019) defines the safety constraint in terms of the expected sum of a vector of measurements to be in a target set. Other approaches (Li & Belta, 2019; Hasanbeig et al., 2019a; b; 2020; Cai et al., 2021; Hasanbeig et al., 2022) define safety by the satisfaction of temporal logical formulae of the learnt policy, but do not provide safety while training such policy. Many existing approaches have been concerned with providing guarantees on the safety of the learned policy sometimes under the assumption that a backup policy is available (Coraluppi & Marcus, 1999; Perkins & Barto, 2002; Geibel & Wysotzki, 2005; Mannucci et al., 2017; Chow et al., 2018b; Mao et al., 2019) . These methods are applicable to systems if they can be trained on accurate simulations, but for many other real-world systems we instead require safety during training. There has also been much research done into the development of approaches to maintaining safety during training. For instance, (Alshiekh et al., 2017; Jansen et al., 2019; Giacobbe et al., 2021) leverage the concept of a shield that stops the agent from choosing any unsafe actions. The shield

