RISK-AWARE BAYESIAN REINFORCEMENT LEARNING FOR CAUTIOUS EXPLORATION

Abstract

This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL), such that the safety constraint violations are bounded at any point during learning. Whilst enforcing safety during training might limit the agent's exploration, we propose a new architecture that handles the trade-off between efficient progress in exploration and safety maintenance. As the agent's exploration progresses, we update Dirichlet-Categorical models of the transition probabilities of the Markov decision process that describes the agent's behaviour within the environment by means of Bayesian inference. We then propose a way to approximate moments of the agent's belief about the risk associated to the agent's behaviour originating from local action selection. We demonstrate that this approach can be easily coupled with RL, we provide rigorous theoretical guarantees, and we present experimental results to showcase the performance of the overall architecture.

1. INTRODUCTION

Traditionally, RL is principally concerned with the policy that the agent generates by the end of the learning process. In other words, the agent's policy during learning is overlooked to the benefit of learning how to behave optimally. Accordingly, many standard RL methods rely on the assumption that the agent selects each available action at every state infinitely often during exploration (Sutton et al., 2018; Puterman, 2014) . A related technical assumption that is often made is that the MDP is ergodic, meaning that every state is reachable from every other state under proper action selection (Moldovan & Abbeel, 2012) . These assumptions may sometimes be reasonable, e.g., in virtual environments where restarting is always an option. However, in safety-critical systems, these assumptions might be unreasonable, as we may explicitly require the agent to never visit certain unsafe states. Indeed, in a variety of RL applications the safety of the agent is particularly important, e.g. expensive autonomous platforms or robots that work in proximity of humans. Thus, researchers are recently paying increasing attention not only to maximising a long-term task-driven reward, but also to enforcing avoidance of unsafe training.

Related Work

The general problem of Safe RL has been an active area of research in which numerous approaches and definitions of safety have been proposed (Brunke et al., 2021; Garcia & Fernandez, 2015; Pecka & Svoboda, 2014) . In (Moldovan & Abbeel, 2012) , safety is defined in terms of ergodicity, with the goal of safety being that an agent is always able to return to its current state after moving away from it. In (Chow et al., 2018a) , safety is pursued by minimising a cost associated with worst-case scenarios, when cost is associated with a lack of safety. Similarly, (Miryoosefi et al., 2019) defines the safety constraint in terms of the expected sum of a vector of measurements to be in a target set. Other approaches (Li & Belta, 2019; Hasanbeig et al., 2019a; b; 2020; Cai et al., 2021; Hasanbeig et al., 2022) define safety by the satisfaction of temporal logical formulae of the learnt policy, but do not provide safety while training such policy. Many existing approaches have been concerned with providing guarantees on the safety of the learned policy sometimes under the assumption that a backup policy is available (Coraluppi & Marcus, 1999; Perkins & Barto, 2002; Geibel & Wysotzki, 2005; Mannucci et al., 2017; Chow et al., 2018b; Mao et al., 2019) . These methods are applicable to systems if they can be trained on accurate simulations, but for many other real-world systems we instead require safety during training. There has also been much research done into the development of approaches to maintaining safety during training. For instance, (Alshiekh et al., 2017; Jansen et al., 2019; Giacobbe et al., 2021) leverage the concept of a shield that stops the agent from choosing any unsafe actions. The shield assumes the agent has to observe the entire MDP (and opponents) to construct a safety (game) model, which will be unavailable for many partially-known MDP tasks. The approach in (Garcia & Fernandez, 2012) assumes a predefined safe baseline policy that is most likely sub-optimal, and attempts to slowly improve it with a slightly noisy action-selection policy, while defaulting to the baseline policy whenever a measure of safety is exceeded. However, this measure of safety assumes that nearby states have similar safety levels, which may not always be the case. Another common approach is to use expert demonstrations to attempt to learn how to behave safely (Abbeel et al., 2010) , or even to include an option to default to an expert when the risk is too high (Torrey & Taylor, 2012) . Obviously, such approaches rely heavily on the presence and help of an expert, which cannot always be counted upon. Other approaches on this problem (Wen & Topcu, 2018; Cheng et al., 2019; Turchetta et al., 2016) are either computationally expensive or require explicit, strong assumptions about the model of agent-environment interactions. Crucially, maintaining safety in RL by efficiently leveraging available data is an open problem (Taylor et al., 2021) . Contributions We tackle the problem of synthesising a policy via RL that optimises a discounted reward, while not violating a safety requirement during learning. This paper puts forward a cautious RL scheme that assumes the agent maintains a Dirichlet-Categorical model of the MDP. We incorporate higher-order information from the Dirichlet distributions, in particular we compute approximations of the (co)variances of the risk terms. This allows the agent to reason about the contribution of epistemic uncertainty to the risk level, and therefore to make better informed decisions about how to stay safe during learning. We show convergence results for these approximations, and propose a novel method to derive an approximate bound on the confidence that the risk is below a certain level. The new method adds a functionality to the agent that prevents it from taking critically risky actions, and instead leads the agent to take safer actions whenever possible, but otherwise leaves the agent to explore as normal. The proposed method is versatile given that it can be added on to general RL training schemes, in order to maintain safety during learning.

2. BACKGROUND

2.1 PROBLEM SETUP Definition 2.1 A finite MDP with rewards (Sutton et al., 2018) is a tuple M = Q, A, q 0 , P, Re where Q = {q 1 , q 2 , q 3 , ..., q N } is a finite set of states, A is a finite set of actions, without loss of generality q 0 is the initial state, P (q |q, a) is the probability of transitioning from state q to state q after taking action a, and Re(q, a) is a real-valued random variable which represents the reward obtained after taking action a in state q. A realisation of this random variable (namely a sample, obtained for instance during exploration) will be denoted by re(q, a). An agent is placed at q 0 ∈ Q at time step t = 0. At every time step t ∈ N 0 , the agent selects an action a t ∈ A, and the environment responds by moving the agent to some new state q t+1 according to the transition probability distribution, i.e., q t+1 ∼ P (•|q t , a t ). The environment also assigns the agent a reward re(q t , a t ). The objective of the agent is to learn how to maximise the long term reward. In the following we explain these notions more formally. Definition 2.2 A policy π assigns a distribution over A at each state: π(a|q) is the probability of selecting action a in state q. Given a policy π, we can then define a state-value function v π (q) = E π ∞ t=0 γ t re(q t , a t ) q 0 = q , where E π [•] denotes the expected value given that actions are selected according to π, and 0 < γ ≤ 1 is a discount factor. Specifically, this means that the sequence q 0 , a 0 , q 1 , a 1 , ... is such that a n ∼ π(•|q n ) and q n+1 ∼ P (•|q n , a n ). The discount factor γ is a pre-determined hyper-parameter that causes immediate rewards to be worth more than rewards in the future, as well as ensuring that this sum is well-defined, provided the standard assumption of bounded rewards. The agent's goal is to learn an optimal policy, namely one that maximises the expected discounted return. This is actually equivalent to finding a policy that maximises the state-value function v π (q) at every state (Sutton et al., 2018) . Definition 2.3 A policy π is optimal if, at every state q, v π (q) = v * (q) = max π v π (q).

