RISK-AWARE BAYESIAN REINFORCEMENT LEARNING FOR CAUTIOUS EXPLORATION

Abstract

This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL), such that the safety constraint violations are bounded at any point during learning. Whilst enforcing safety during training might limit the agent's exploration, we propose a new architecture that handles the trade-off between efficient progress in exploration and safety maintenance. As the agent's exploration progresses, we update Dirichlet-Categorical models of the transition probabilities of the Markov decision process that describes the agent's behaviour within the environment by means of Bayesian inference. We then propose a way to approximate moments of the agent's belief about the risk associated to the agent's behaviour originating from local action selection. We demonstrate that this approach can be easily coupled with RL, we provide rigorous theoretical guarantees, and we present experimental results to showcase the performance of the overall architecture.

1. INTRODUCTION

Traditionally, RL is principally concerned with the policy that the agent generates by the end of the learning process. In other words, the agent's policy during learning is overlooked to the benefit of learning how to behave optimally. Accordingly, many standard RL methods rely on the assumption that the agent selects each available action at every state infinitely often during exploration (Sutton et al., 2018; Puterman, 2014) . A related technical assumption that is often made is that the MDP is ergodic, meaning that every state is reachable from every other state under proper action selection (Moldovan & Abbeel, 2012) . These assumptions may sometimes be reasonable, e.g., in virtual environments where restarting is always an option. However, in safety-critical systems, these assumptions might be unreasonable, as we may explicitly require the agent to never visit certain unsafe states. Indeed, in a variety of RL applications the safety of the agent is particularly important, e.g. expensive autonomous platforms or robots that work in proximity of humans. Thus, researchers are recently paying increasing attention not only to maximising a long-term task-driven reward, but also to enforcing avoidance of unsafe training.

Related Work

The general problem of Safe RL has been an active area of research in which numerous approaches and definitions of safety have been proposed (Brunke et al., 2021; Garcia & Fernandez, 2015; Pecka & Svoboda, 2014) . In (Moldovan & Abbeel, 2012) , safety is defined in terms of ergodicity, with the goal of safety being that an agent is always able to return to its current state after moving away from it. In (Chow et al., 2018a) , safety is pursued by minimising a cost associated with worst-case scenarios, when cost is associated with a lack of safety. Similarly, (Miryoosefi et al., 2019) defines the safety constraint in terms of the expected sum of a vector of measurements to be in a target set. Other approaches (Li & Belta, 2019; Hasanbeig et al., 2019a; b; 2020; Cai et al., 2021; Hasanbeig et al., 2022) define safety by the satisfaction of temporal logical formulae of the learnt policy, but do not provide safety while training such policy. Many existing approaches have been concerned with providing guarantees on the safety of the learned policy sometimes under the assumption that a backup policy is available (Coraluppi & Marcus, 1999; Perkins & Barto, 2002; Geibel & Wysotzki, 2005; Mannucci et al., 2017; Chow et al., 2018b; Mao et al., 2019) . These methods are applicable to systems if they can be trained on accurate simulations, but for many other real-world systems we instead require safety during training. There has also been much research done into the development of approaches to maintaining safety during training. For instance, (Alshiekh et al., 2017; Jansen et al., 2019; Giacobbe et al., 2021) leverage the concept of a shield that stops the agent from choosing any unsafe actions. The shield assumes the agent has to observe the entire MDP (and opponents) to construct a safety (game) model, which will be unavailable for many partially-known MDP tasks. The approach in (Garcia & Fernandez, 2012) assumes a predefined safe baseline policy that is most likely sub-optimal, and attempts to slowly improve it with a slightly noisy action-selection policy, while defaulting to the baseline policy whenever a measure of safety is exceeded. However, this measure of safety assumes that nearby states have similar safety levels, which may not always be the case. Another common approach is to use expert demonstrations to attempt to learn how to behave safely (Abbeel et al., 2010) , or even to include an option to default to an expert when the risk is too high (Torrey & Taylor, 2012) . Obviously, such approaches rely heavily on the presence and help of an expert, which cannot always be counted upon. Other approaches on this problem (Wen & Topcu, 2018; Cheng et al., 2019; Turchetta et al., 2016) are either computationally expensive or require explicit, strong assumptions about the model of agent-environment interactions. Crucially, maintaining safety in RL by efficiently leveraging available data is an open problem (Taylor et al., 2021) . Contributions We tackle the problem of synthesising a policy via RL that optimises a discounted reward, while not violating a safety requirement during learning. This paper puts forward a cautious RL scheme that assumes the agent maintains a Dirichlet-Categorical model of the MDP. We incorporate higher-order information from the Dirichlet distributions, in particular we compute approximations of the (co)variances of the risk terms. This allows the agent to reason about the contribution of epistemic uncertainty to the risk level, and therefore to make better informed decisions about how to stay safe during learning. We show convergence results for these approximations, and propose a novel method to derive an approximate bound on the confidence that the risk is below a certain level. The new method adds a functionality to the agent that prevents it from taking critically risky actions, and instead leads the agent to take safer actions whenever possible, but otherwise leaves the agent to explore as normal. The proposed method is versatile given that it can be added on to general RL training schemes, in order to maintain safety during learning.

2. BACKGROUND

2.1 PROBLEM SETUP Definition 2.1 A finite MDP with rewards (Sutton et al., 2018) is a tuple M = Q, A, q 0 , P, Re where Q = {q 1 , q 2 , q 3 , ..., q N } is a finite set of states, A is a finite set of actions, without loss of generality q 0 is the initial state, P (q |q, a) is the probability of transitioning from state q to state q after taking action a, and Re(q, a) is a real-valued random variable which represents the reward obtained after taking action a in state q. A realisation of this random variable (namely a sample, obtained for instance during exploration) will be denoted by re(q, a). An agent is placed at q 0 ∈ Q at time step t = 0. At every time step t ∈ N 0 , the agent selects an action a t ∈ A, and the environment responds by moving the agent to some new state q t+1 according to the transition probability distribution, i.e., q t+1 ∼ P (•|q t , a t ). The environment also assigns the agent a reward re(q t , a t ). The objective of the agent is to learn how to maximise the long term reward. In the following we explain these notions more formally. Definition 2.2 A policy π assigns a distribution over A at each state: π(a|q) is the probability of selecting action a in state q. Given a policy π, we can then define a state-value function v π (q) = E π ∞ t=0 γ t re(q t , a t ) q 0 = q , where E π [•] denotes the expected value given that actions are selected according to π, and 0 < γ ≤ 1 is a discount factor. Specifically, this means that the sequence q 0 , a 0 , q 1 , a 1 , ... is such that a n ∼ π(•|q n ) and q n+1 ∼ P (•|q n , a n ). The discount factor γ is a pre-determined hyper-parameter that causes immediate rewards to be worth more than rewards in the future, as well as ensuring that this sum is well-defined, provided the standard assumption of bounded rewards. The agent's goal is to learn an optimal policy, namely one that maximises the expected discounted return. This is actually equivalent to finding a policy that maximises the state-value function v π (q) at every state (Sutton et al., 2018) . Definition 2.3 A policy π is optimal if, at every state q, v π (q) = v * (q) = max π v π (q). Definition 2.4 Given a policy π, we can define a state-action-value function v π (q, a) = E π [ ∞ t=0 γ t re(q t , a t )| q 0 = q, a 0 = a] , similarly to the state-value function. This allows us to reinterpret the state-value function as v π (q) = a v π (q, a)π(a|q), and thus we can see that an optimal deterministic policy π must assign zero probability to any action a that doesn't maximise the state-action value function.

2.2. DIRICHLET-CATEGORICAL MODEL OF THE MDP

We consider a model for an MDP with unknown transition probabilities (Ghavamzadeh et al., 2015) . The transition probabilities for a given state-action pair are assumed to be described by a categorical distribution over the next state. We maintain a Dirichlet distribution over the possible values of those transition probabilities: since the Dirichlet distribution is conjugate to the categorical distribution, we can employ Bayesian inference to update the Dirichlet distribution, as new observations are made while the agent explores the environment. Formally, for each state-action pair (q i , a), we have a Dirichlet distribution p i1 a , p i2 a , ..., p iN a ∼ Dir(α i1 a , α i2 a , ...α iN a ). The random variable p ij a represents the agent's belief about the transition probability P (q j |q i , a). At the start of learning, the agent will be assigned a prior Dirichlet distribution for each state-action pair, according to its initial belief about the transition probabilities. At every time step, as the agent moves from some state q i to some state q k by taking action a, it will generate an event q i a -→ q k , which constitutes new data for the Bayesian inference. From Bayes' rule: P r(p i a = x i a |q i a -→ q k ) ∝ P r(q i a -→ q k |p i a = x i a )P r(p i a = x i a ) = x ik a j (x ij a ) α ij a -1 = [ j =k (x ij a ) α ij a -1 ](x ik a ) (α ik a +1)-1 , which immediately yields P r(p i a = x i a |q i a -→ q k ) = Dir(α i1 a , α i2 a , ..., α ik a + 1, ..., α iN a ). Thus, the posterior distribution is also a Dirichlet distribution. This update is repeated at each time step: the relevant information to the agent's posterior belief about the transition probabilities is the starting prior Dir(α i1 a , α i2 a , ...α iN a ) and the "transition counts" c ij a , keeping track of the number of times that q i a -→ q j has occurred. The agent's posterior is then (p i1 a , p i2 a , ..., p iN a ) ∼ Dir(α i1 a , α i2 a , ...α iN a ): from this distribution, we can distill the expected value pij a of each random variable p ij a , as well as the covariance of any two p ij a and p ik a (therefore also the variance of a single p ij a ): pij a = E[p ij a ] = α ij a α i0 a , Cov[p ij a , p ik a ] = α ij a (δ jk α i0 a -α ik a ) (α i0 a ) 2 (α i0 a + 1) , where α i0 a = N k=1 α ik a , and δ jk is the Kronecker delta.

3. RISK-AWARE BAYESIAN RL FOR CAUTIOUS EXPLORATION

In this section we propose a new approach to Safe RL, which will specifically address the problem of how to learn an optimal policy in an MDP with rewards, while avoiding certain states classified as unsafe during training. The agent is assumed to know which states of the MDP are safe and which are unsafe, but instead of assuming that the agent has this information globally, namely for all states of the MDP, we find it more reasonable that the agent observes states within an area around itself. This closely resembles real-world situations, where systems may have sensors that allow them to detect close-by danger areas, but not necessarily know about danger zones that are far away from them. In particular, we assume that there is an observation "boundary" O, such that the agent can observe all states that are reachable from the current state within O steps and distinguish which of those states are safe or unsafe. The rest of this section is structured as follows: 1. In Section 3.1, we define the risk r m c (a) over m steps of taking an action a at the current state, denoted as q c . We then introduce a random variable R m c (a) representing the agent's belief about the risk; 2. In Section 3.2,we leverage a method from (Casella & Berger, 2021) to approximate the expected value and variance of the random variable R m c (a). We provide convergence results on the approximations of the expectation and variance of R m c (a); 3. In Section 3.3, we show how the Cantelli Inequality (Cantelli, 1929) allows us to estimate a confidence bound on the risk r m c (a); 4. In Section 3.4, we prescribe a methodology for incorporating the expectation and variance of risk into the action selection during the training of an RL agent.

3.1. DEFINITION AND CHARACTERISATION OF THE RISK

Given the observation boundary O, we reason about the risk incurred over the next m steps after taking a particular action a in the current state q c , for any m ≤ O. However, note that there is a dependence between the agent's estimate of such a risk and the use of that estimate to inform its action selection policy. In order to solve this dilemma we severe the dependency between the risk that we calculate and the actions selected generating that risk by fixing a policy over the m-step horizon, and calculating the risk given that policy. Similar to temporal-difference learning schemes, this is done by assuming best-case action selection, namely, the m-step risk r m c (a) at state q c after taking action a is defined assuming that after selecting action a, the agent will select subsequent actions to minimize the expected risk going forward. Assuming that the agent is at state q c , we define the agent's approximation of the m-step risk Rm c (a) by back-propagating the risk given the "expected safest policy" over m steps, as follows:

R0

k = 1(q k is observed and unsafe); (1) Rn+1 k (a) = 1 if q k is observed and unsafe N j=1 pkj a Rn j otherwise; (2) Rn+1 k = 1 if q k is observed and unsafe min a∈A Rn+1 k (a) otherwise. We terminate this iterative process at n + 1 = m and once we have calculated Rm c (a) (c = k) for all actions a. Note that, despite the use of progressing indices n, this is an iterative back-propagation that leverages the expected values of agent's belief about the transition probabilities, i.e., pkj a . Thus, Rm c (a) is the agent's approximation of the expectation of the probability of entering an unsafe state within m steps by selecting action a at state q c , and thereafter by selecting actions that it currently believes will minimize the probability of entering unsafe states over the given time horizon. The term pkj a = E[p kj a ] is used as a point estimate of the true transition probability t kj a = P (q j |q k , a). The value of Rm c only relies on states which the agent believes are reachable from q c within m steps. In particular so long as the horizon m is less than the observation boundary O, the agent is able to observe all states which are relevant to the calculation of Rm c (a), so specifically, 1(q j is unsafe) = 1(q j is observed and unsafe) for all relevant states q j (see Appendix G for more details).

3.2. APPROXIMATION OF EXPECTED VALUE AND COVARIANCE OF THE RISK

Let x denote the vector of variables x ij a where i, j range from 1 to N and a ranges over A, i.e., x = (x ij a ) i,j=1,...,N and a∈A . We assume that these indices are ordered lexicographically by (i, a, j). This is because i and a will be used to signify a state-action pair (q i , a), and j will be used to signify a potential next state q j . Introduce a set of functions (we shall see they take the shape of polynomials) g n k [x] defined, for each state q k , as follows: g 0 k [x] := 1(q k is observed and unsafe); g n+1 k (a)[x] := 1 if q k is observed and unsafe N j=1 x kj a g n j [x] otherwise; g n+1 k [x] := 1 if q k is observed and unsafe g n+1 k arg min a Rn+1 k (a) [x] otherwise. Then we can write the risk (of selecting action a in state q c , over m steps) defined above as r m c (a) = g m c (a)[t] , where t = (t ij a ) i,j=1,...,N and ∀a∈A is a vector of all "true" transition probabilities t ij a := P (q j |q i , a). We can similarly write the agent's approximation of the risk as Rm c (a) = g m c (a) [p] , where similarly p = (p ij a ) i,j=1,...,N and a∈A . We refer to the actions specified by these argmin operators as the agent's expected safest action in each state over the next m steps. Now, crucially, we can also define a new random variable R m c (a) = g m c (a) [p] , where p = (p ij a ) i,j=1,...,N and ∀a∈A . Since the p ij a s are random variables representing the agent's beliefs about the true transition probabilities t ij a , we in fact have that this random variable R m c (a) represents the agent's beliefs about the true risk r m c (a). In the following, we show that Rm c (a) can be viewed as an approximation of E[R m c (a)], and we provide and justify an approximation of V ar[R m c (a)]. These approximations can be used by the agent to reason about the true risk r m c (a). In order to construct approximations of the expectation and variance of R m c (a), we make use of the first-order Taylor expansion of g m c (a)[x] around x = p, following a method in (Casella & Berger, 2021) . The Taylor expansion is g m c (a) [x] = g m c (a) [p] + N i,j=1 b∈A ∂g m c (a) ∂x ij b (x ij b -pij b ) + remainder, where the partial derivatives are also evaluated at p. Now we can turn equation 4 into a statistical approximation by dropping the remainder and reasoning over the random variables p for x, namely: g m c (a) [p] ≈ g m c (a) [p] + N i,j=1 b∈A ∂g m c (a) ∂x ij b (p ij b -pij b ). We can then take the expectation of both sides, obtaining E[g m c (a) [p]] ≈ E[g m c (a) [p]] + E[ N i,j=1 b∈A ∂g m c (a) ∂x ij b (p ij b -pij b )] = g m c (a) [p] + N i,j=1 b∈A ∂g m c (a) ∂x ij b E[(p ij b -pij b )] = g m c (a) [p] , where the above steps follow since the only random term in the right-hand side is p ij b , for which E(p ij b ) = pij b . Recall that g m c (a) [p] = R m c (a) and g m c (a) [p] = Rm c (a) . Thus, we now have Rm c (a) as an approximation of the expectation of R m c (a). For the approximation of the variance of the agent's believed risk, which is again a random variable, we can write: V ar(g m c (a)[p]) ≈ E[(g m c (a)[p] -g m c (a)[p]) 2 ] ≈ E      N i,j=1 b∈A ∂g m c (a) ∂x ij b (p ij b -pij b )   2    (from equation 5) = N i,j,s,t=1 b1,b2∈A ∂g m c (a) ∂x ij b1 ∂g m c (a) ∂x st b2 Cov(p ij b1 , p st b2 ) = N i=1 b∈A N j,t=1 ∂g m c (a) ∂x ij b ∂g m c (a) ∂x it b Cov(p ij b , p it b ) = V m c (a), where V m c (a) is the approximation for the variance of R m c (a), i.e., ≈ V ar(R m c (a)), and the last line follows from the fact that the covariance between two transition probability beliefs p ij b1 and p st b2 is always 0, unless they correspond to the same starting state-action pair (q i , b). In other words, Cov(p ij b1 , p st b2 ) = 0 unless i = j and b 1 = b 2 . Next, we show consistency of the estimate in the limit (see Appendix E for the proof). Theorem 3.1 Under Q-learning convergence assumptions (Watkins, 1989) , namely that reachable state-action pairs are visited infinitely often, the estimate of the mean of the believed risk distribution Rm c (a) converges to the true risk r m c (a), and it does so with the variance of the believed risk distribution V ar(g m c (a)[p]) approaching the estimate of that variance V m c (a). Specifically, Rm c (a) -r m c (a) V m c (a) → N (0, 1) in distribution.

3.3. ESTIMATING A CONFIDENCE ON THE APPROXIMATION OF THE RISK

So far we have shown that when the agent is in the current state q c , for each possible action a, approximations of the expectation and variance of its belief R m c (a) about the risk r m c (a) can be formally obtained: we denote these two approximations by Rm c (a) and V m c (a), respectively. We describe a method for combining these approximations to obtain a bound on the level of confidence that the risk r m c (a) is below a certain threshold. We appeal to the Cantelli Inequality, which is a one-sided Chebychev bound (Cantelli, 1929) . Having computed Rm c (a) and V m c (a), for a particular confidence value 0 < C < 1 we can define P a := Rm c (a) + V m c (a)C 1-C . From the Cantelli Inequality we then have Pr(R m c (a) ≤ P a ) ≥ C. Specifically, P a is the lowest risk level such that, according to its approximations, the agent can be at least 100 × C % confident that the true risk is below level P a . The agent can therefore leverage P a when attempting to perform safe exploration (please refer to Appendix F for more details).

3.4. RISK-AWARE BAYESIAN RL FOR CAUTIOUS EXPLORATION (RCRL)

We propose a setup for Safe RL that leverages the expectation and variance of the risk to allow an agent to explore the environment safely, while attempting to learn an optimal policy. In order to pick the most optimal yet safe action at each state, we propose a double-learner architecture, referred to as Risk-aware Cautious RL (RCRL) and explained next. The first learner is an optimistic agent that employs Q-learning (QL) to maximize the expected cumulative return. The second learner is a pessimistic agent that maintains a Dirichlet-Categorical model of the transition probabilities of the MDP. In particular, this agent is initialized with a prior P ri that encodes any information the agent might have about the transition probabilities. For each state-action pair (q i , a) we have a Dirichlet distribution p i1 a , p i2 a , ..., p iN a ∼ Dir(α i1 a , α i2 a , ...α iN a ). As the agent explores the environment, the Dirichlet distributions are updated using Bayesian inference. For each action a available in the current state q c , the pessimistic learner computes the approximations Rm c (a) and V m c (a) of its belief R m c (a) of the risk over the next m steps of taking action a in q c . The "risk horizon" m is a hyper-parameter that, as discussed, should be set at most as the observation boundary O. The pessimistic learner is also initialized with two hyper-parameters P max and C(n): P max represents the maximum level of risk that the agent should be prepared to take, whereas C(n) is a decreasing function of the number of times n that the current state has been visited, which satisfies C(0) < 1 and lim n→∞ C(n) = 0. From Section 3.3, the agent can then compute, for each action a, the value P a = Rm c (a) + V m c (a)C(n) 1 -C(n) , which can thus define a set of safe actions: these are all the actions that the agent believes have risk less than P max , with confidence at least C(n), namely A saf e = {a ∈ A|P a ≤ P max }. In case there are no actions a such that P a ≤ P max , the agent instead allows A saf e = {a ∈ A| Rm c (a) = min a Rm c (a)}. Finally, the agent selects an action a * from the set of safe actions using softmax action selection (Sutton et al., 2018) according to the Q-values of those actions, with some temperature t > 0: P r(a * = a) = e Q(q c ,a)/t a∈Asafe e Q(q c ,a)/t . (9) The pseudo-code for the full algorithm is available in Appendix B. In summary, we effectively have two agents learning to accomplish two tasks. The first agent performs Q-learning to learn an optimal policy for the reward. The second agent determines the best approximation of the expected value and variance of each action, enabling it to prevent the first agent from selecting actions that it cannot guarantee to be safe enough (with at least a given confidence). When instead the pessimistic agent cannot guarantee that any action is safe enough, it forces the optimistic learner to go into "safety mode", i.e., to forcibly select the actions that minimize the expected value of the risk, as per equation 8. From an empirical perspective, implementing this concept of a "safety mode" allows for continued progress, and pairs extremely well with the definition of the risk: namely, when the agent deems that a state is too risky, it will go into this "safety mode" until it is back in a state with sufficiently safe actions. Finally, note that C(n) represents the level of confidence that the agent requires in an action being safe enough for it to consider taking that action. When the agent starts exploring and C(n) is at its highest, the agent only explores actions that it is very confident in. However, it may need to take actions that it is less confident in order to find an optimal policy. Thus, as it continues exploring, C(n) is reduced, allowing the agent to select actions upon which it is not as confident. However, in the limit, when C(n) → 0, we have that P a = Rm c (a), which means that the agent never takes an action if its approximation of the expected value of the risk Rm c (a) is more than the maximum allowable risk P max .

4. EXPERIMENTS

Gridworld -We first evaluated the performance of RCRL on a Slippery Gridworld Bridge example. The states of the MDP consist of a 20 × 20-grid, as depicted in Figure 2a (Appendix C ). The agent is initialized at q 0 in the bottom-left corner (green). The agent's task is to get to the goal region without ever entering an unsafe state. In particular, upon reaching a goal state, the agent is given a reward of 1 and the learning episode is terminated; at every other state it receives a reward of 0, and upon reaching an unsafe state the learning episode terminates with reward 0. At each time step the agent might move into one of the 4 neighbouring states, or stay in its current position; thus, the agent has access to 5 actions at each state, A = {right, up, left, down, stay}. If the agent selects action a ∈ A, then it has a 96% chance of moving in direction a, and a 4% chance of "slipping", namely moving to another random direction. If any movement would ever take the agent outside of the grid, then the agent will just remain in place. The agent is assumed to have an observation boundary O = 2 steps. Note that due to the slipperiness of the grid and the narrow passage to reach the goal state, minimizing the risk is not aligned with maximizing the expected reward. We tested RCRL with 5 different combinations of a prior P ri and a maximum acceptable risk P max . The following additional hyper-parameters of the algorithm were kept constant: the maximum number of steps per episode max steps = 400, the maximum number of episodes max episodes = 500 (although this was increased to 1500 in two cases when the agent did not converge to nearoptimal policy within the first 500, cf. Table 1 ); the learning rate µ = 0.85; the discount factor γ = 0.9; and the risk horizon m = 2 (Appendix B). Recall that a prior consists of a Dirichlet distribution p i1 a , ..., p iN a ∼ Dir(α i1 a , ..., α iN a ) for every state-action pair (q i , a). We considered three priors: • Prior 1 -completely uninformative: in this case we assigned a value of 1 to every α. This yields a distribution that is uniform over its support. • Prior 2 -weakly informative: we assigned a value of 12 to the α corresponding to moving in the correct direction, and a value of 1 to all other α's. This gives a distribution in between Prior 1 and Prior 3 in both degree of bias and concentration. • Prior 3 -highly informative: we assigned a value of 96 to the α corresponding to moving in the correct direction, and a value of 1 to all other α's. This gives a distribution that is highly concentrated, and for which the mean values of the transition probability random variables are the true transition probabilities of the MDP, and hence unbiased. We tested the algorithm with all three priors and a maximum acceptable risk of P max = 0.01 and repeating each experiment 10 times to take averages. On average, the agent with the highly informative prior (Prior 3) entered unsafe states 14.4 times, and always converged to near-optimality within about 200 steps, successfully crossing the bridge 407.4 times. For the other 78.2 episodes, the agent reached the episode limit within crossing the bridge or entering an unsafe state. The agent with Prior 2 interestingly only entered unsafe states an average of 0.5 times per experiment, and converged to a near-optimal policy within about 300 episodes, successfully crossing the bridge 384.6 times. On the other hand, the agent with Prior 1 only crossed the bridge less than 30 times. We therefore increased the total number of episodes to 1500 and tried again, yet still over half the time it did not converge to a near-optimal policy (Appendix A). We then tested Prior 1 with a more lenient maximum acceptable risk of P max = 0.33, and found that the agent this time managed to converge to near-optimality within around 200 episodes, entering unsafe states 54.2 times and successfully crossing the bridge 404.3 times. We also tested Prior 3 with a stricter P max = 0.0033 and found out that it entered unsafe states only 1.1 times and succeeded 421.3 times, converging to near-optimality within 150 episodes (Appendix A). Finally, we tested native Q-learning, without any safe learning scheme. This native scheme had almost no successful crossings of the bridge in the first 500 episodes, so we ran it for 1500 episodes and found that it only converged to a near-optimal policy about half the time, on average entering unsafe states 990.5 times and successfully crossing the bridge 414.6 times. Table 1 summarizes the number of successes and failures for each agent. To understand better the rate of convergence to near-optimality, Figure 1 (Appendix A) displays the number of steps taken by the agent to cross the bridge at every successful episode (it displays 400 if the agent never managed to cross the bridge) averaged over the 10 experiments. On each graph we display for comparison the theoretical least number of steps it could cross the bridge in, which is 22. Note that because the grid-world is slippery, even an optimal policy would have fluctuations above the 22-steps line. Discussion The first result of note is how poorly Prior 1 performs with P max = 0.01. It mostly fails to converge to near-optimal behaviour even with 1500 steps as can be seen in Figure 1b (Appendix A), in fact seeming to converge slower than native Q-learning. This occurs because the maximum allowable risk is set too low for the given prior. In particular, there are two main issues with this. The first issue is a type of degenerate behaviour specific to our algorithm and to the completely uninformative prior with overly strict P max : given that the agent starts with no information on the transition probabilities, it is unable to tell which actions are safe and which are unsafe. In particular, with P max at 1%, the first time the agent arrives at any state q c from which it can observe some unsafe state, it immediately goes into safety mode as it judges that the risk of every action is above 1%. Since it has no information on which action is safest, it randomly selects an action (assuming the Q-values were initialized to 0). If that randomly-selected action does not take the agent closer to a risky state, then after updating the agent's beliefs about the transition probabilities for that action, it will believe that action is the safest one from that state. Thus every time it encounters that state again, it will always select that action, never attempting any other actions. This behaviour can be seen in Figure 2b (Appendix C ). The state (13, 1) has been visited significantly more often than any other state. This has occurred because the first time the agent encountered that state, it chose action stay, and as above, from then on always chose stay in state (13, 1). This would cause the agent to remain in (13, 1) until it slipped off of that state. The second issue with having such a strict P max could involve any prior. In this case P max is set so low that actions that may be optimal are simply never tested, as the agent's initial belief about those actions causes the expected risk associated with them to always be greater than P max . This should not be viewed as an undesirable consequence of the algorithm, but rather as the algorithm working as intended. With the maximum allowable risk level P max set so low, the agent judges that certain actions are riskier than acceptable and therefore does not take them. However, this does raise a more general question about the nature of safe learning in general: ensuring safety while learning necessarily means avoiding actions we believe are too dangerous, so if we want any guarantees on safety, then we must accept that the agent may be unable to explore the entire state space. The second result of note is that Prior 3 performs much less safely than Prior 2 does at P max = 0.01. This seems counter intuitive at first, given that Prior 3 is more accurate and more confident than Prior 2. However, the explanation is quite simple. Prior 3 (initially) causes the agent's expected belief to correctly predict that there is only a 1% chance of moving to an unsafe state on a particular step if the agent selects the action to move away from it. On the other hand, Prior 2 causes the agent's expected belief to predict there is a 6.25% chance of this happening. Thus, Prior 3 (correctly) evaluates the risk of moving within 1 step of a risky state as much lower than Prior 2 does. It is likely that at some points in the experiments, the agent with Prior 3 chose to move within 1 step of an unsafe state where an agent starting with Prior 2 (with the same experiences) would have rejected that action as too risky. The agent with Prior 3 would then be at risk of slipping into an unsafe state. In Figure 2c and 2d (Appendix C), we can see exactly this happening, where Prior 3 regularly visits state (13, 8), which is adjacent to the unsafe state (12, 8). Prior 2 instead regularly moves one more state to the right before moving up to row 13, since (12, 9) is safe. Prior 3 with P max = 0.0033 shows how we can make use of a highly accurate prior to guarantee even less risk, and in this case the agent almost never enters unsafe states, while converging faster than any other setup to near-optimality. The final result is that the rate of convergence of the native Q-learning agent is much slower on this MDP than the other agents (excluding Prior 1 with the inappropriate P max = 0.01). As in Figure 1 (Appendix A), Q-learning took between 300 and 1500 episodes to converge when it did, and occasionally failed to converge, compared to 150-300 episodes for the four other agents to converge in all 10 experiments. This was even the case for the agent with the completely uninformative prior, with P max = 0.33. This is a key result: it shows that not only can RCRL keep the agent safe during learning when possible, it may also direct the agent to explore more fruitful areas of the state-space. In this case study in particular, the native Q-learning agent entered unsafe states so often initially that it took many episodes before it was able to access the bridge and find the reward at the other side. Conversely, since the safe agents mostly avoided "sinking" situations, they were able to explore much more of the state space on each episode. PacMan -We also evaluated the performance of RCRL on a PacMan example. Figure 3a (Appendix D) depicts the initial state of the environment, where the agent (PacMan) must get to both yellow dots (food) without getting caught by the ghost. Note that because both the agent and the ghost move through the maze, the PacMan MDP has about 10 times more states than the Gridworld, and up to 5 times more possible next states at any given state. Upon picking up the second piece of food, the agent is given a reward of 1 and the learning episode stops. Every other state incurs a reward of 0 and if the ghost catches PacMan, the learning episode stops with reward 0. The agent has access to four actions at each state, A = {right, up, left, down} and will move in the direction selected, or if that direction moves into a wall, then it will stay still. The ghost will with 90% probability move in the direction that takes it closest to the agent's next location, and with 10% probability will move in a random direction. For this setup, we assumed an observation boundary O = 3 and compared two values of the risk horizon, m = 2, 3. We therefore kept constant the other parameters and hyper-parameters: the learning rate µ = 0.85; the discount factor γ = 0.9; the maximum number of steps per episode max steps = 400; the maximum acceptable risk P max = 0.33; the prior, which we set to be a completely uninformative prior as in the Gridworld example; the maximum number of episodes, which we set as 1500 or the number of episodes before the total rate of successful episodes exceeded 75%. As in Table 1 , the agent with a risk horizon of m = 2 steps exceeded a success rate of 75% after 311 episodes, having failed 77 times. The agent with the larger risk horizon of m = 3 only took 275 steps to exceed that success rate, and only failed 68 times. Figures 3b-3c (Appendix D ) display the number of steps taken by the agent to win (or 400 if they lose) for each agent, as well as the running average number of steps over the previous 50 episodes. Discussion The improvement in performance from m = 2 to 3 is likely due to the increased foresight of the agent leading it to move away from excessively risky scenarios further in advance, potentially avoiding entering a state from which entering a dangerous state is unavoidable. However, it may also be simply due to the fact that increasing the risk horizon leads to an overall increase in risk estimates, which will naturally cause more actions to be considered too risky and may reduce the number of failures. In other words, we may have been in a situation where decreasing the maximum acceptable risk P max would have led to similar improvements, and the increase in risk horizon was behaving functionally more like a decrease in P max . Both risk-aware agents compare very favourably against the Native Q-Learning agent, which did not succeed once in 1500 episodes. 

B APPENDIX B. RISK-AWARE CAUTIOUS RL -PSEUDO CODE

Algorithm 1: Risk-aware Cautious RL (RCRL) input: P ri, C(n), P max , max steps, max episodes, µ, γ, m (1) initialize Q(q, a) for each state-action pair (q, a); (2) initialize num steps = 0 ; (3) initialize num episodes = 0 ; while num episodes < max episodes do q c ← q 0 ; (5) num episodes ← num episodes + 1; while num steps < max steps and q c is not unsafe do (6) calculate Rm c (a) as in ( 2) ; (7) calculate V m c (a) as in ( 6) ; calculate P a as in ( 7) ; A saf e := {a ∈ A|P a ≤ P max } ; if A saf e = ∅ then A saf e ← {a ∈ A| Rm c (a) = min a Rm c (a)} ; end (11) choose action a * according to (9) ; (12) pass action a * to environment and receive next state q and reward re(q c , a * ) ; (13) update belief p as in section 2 ;  (14) update Q(q c , a * ) ← (1 -µ)Q(q c , a * ) + µ (re(q c , a * ) + γ max a Q(q , a )) ; (15) q c ← q ; (16) num steps ← num steps + 1; end end C APPENDIX C. GRIDWORLD EXPERIMENT DETAILS (a) (b) (c)

E APPENDIX E. CONVERGENCE RESULTS FOR THE APPROXIMATIONS OF THE EXPECTED VALUE AND VARIANCE OF THE RISK

Theorem E.1 Under Q-learning convergence assumptions (Watkins, 1989) , namely that reachable state-action pairs are visited infinitely often, the estimate of the mean of the believed risk distribution Rm c (a) converges to the true risk r m c (a), and it does so with the variance of the believed risk distribution V ar(g m c (a)[p]) approaching the estimate of that variance V m c (a). Specifically, Rm c (a) -r m c (a) V m c → N (0, 1) in distribution Proof. Let us first rewrite the expressions in equation 6 in vector form, first introducing the following covariance matrix for p:  Σ =      Cov(p 11 b1 , p Cov(p N N b M , p N N b M )      . Recall that the variables p ij a are ordered lexicographically by (i, a, j). Here we wrote b 1 for the first action in A and b M for the last one, assuming |A| = M . Using matrix Σ, we can rewrite equation 6 for the approximate variance as V ar(R m c (a)) ≈ (∇g m c (a)[p]) T Σ (∇g m c (a)[p]) , ∇g m c (a)[p] =         ∂g m c (a) ∂x 11 b 1 ∂g m c (a) ∂x 12 b 1 . . . ∂g m c (a) ∂x N N b M         x=p , where ∇g m c (a)[p] is the gradient vector of g m c (a) evaluated at p. In the following, we employ the 'Delta Method' as described in (Casella & Berger, 2021) to allow us to derive a convergence result for the approximations for the mean and variance of R m c (a) that we defined above. Let us introduce a semi-vectorised representation of equation 6 where we still leverage the fact that covariances across different state-action pairs are 0, i.e., Σ i b =      Cov(p i1 b , p i1 b ) Cov(p i1 b , p i2 b ) ... Cov(p i2 b , p i1 b ) Cov(p i2 b , p i2 b ) . . . . . . Cov(p iN b , p iN b )      is the variance-covariance matrix for (p ij b ) j=1,...,N . Since Σ is built by listing the Σ i b along the diagonal for i = 1, ..., N and b ∈ A, with zeros elsewhere, we have that equation 6 can be rewritten as We refer to this approximation for the variance of R m c (a) as V m c (a) (≈ V ar(R m c (a))). Consider the random vector X = (X ij a ) i,j=1,...,N and a∈A (with the previously discussed lexicographic order on the X ij a ) where each (X ij a ) N j=1 follows a Categorical distribution with probabilities t ij a -i.e. a realisation of the vector X represents the result of taking one transition from every state-action pair. Wherever X ij a = 1 it represents a transition q i a -→ q j . X then has means t and covariances V ar(R m c (a)) ≈ N i=1 b∈A ∇ i b g m c (a)[p] T Σ ∇ i b g m c (a)[p] , ∇ i b g m c (a)[p] =        ∂g m c Cov(X ij a , X st b ) = -t ij a t st b if i = s and a = b 0 otherwise We can then write the variance-covariance matrix for X as Σ XX =      Cov(X 11 b1 , X 11 b1 ) Cov(X 11 b1 , X 12 b1 ) ... Cov(X 12 b1 , X 11 b1 ) Cov(X 12 b1 , X 12 b1 ) . . . . . . Cov(X N N b M , X N N b M )      , If we observe independent random samples X (1) , X (2) , ..., X (n) and denote the sample means as k) then for the function g n c (a) [x] we have, Xij b = 1 n n k=1 (X ij b ) (k) , or X = 1 n n k=1 X ( g m c (a)[ X] ≈ g m c (a)[t] + N i,j=1 b∈A ∂g m c (a) ∂x ij b ( Xij b -t ij b ), This is a direct result from the first-order Taylor expansion around t, and therefore the derivatives are evaluated at t. In vector notation, we have g m c (a)[ X] ≈ g m c (a)[t] + (∇g m c (a)[t]) T ( X -t) , where (∇g m c (a)[t]) =        ∂g m c (a) ∂x 11 b ∂g m c (a) ∂x 12 b . . . ∂g m c (a) ∂x N N z        x=t From the 'Multivariate Delta Method' theorem (Casella & Berger, 2021) , as long as τ 2 := (∇g m c (a)[t]) T Σ XX (∇g m c (a)[t] ) > 0, which we will prove later in Lemma 1 and Lemma 2, we have the following convergence: √ n g m c (a)[ X] -g m c (a)[t] → N (0, τ 2 ) in distribution. ( ) Note that this is equivalent to √ n g m c (a)[ X] -g m c (a)[t] τ → N (0, 1) in distribution, ( ) where τ := √ τ 2 . In the following we define p(n) and Σ (n) to be what p and Σ would have been had the agent started with it's prior about the transition probabilities p and then witnessed exactly the transitions represented by the random sample X (1) , X (2) , ..., X (n) . Formally, suppose that the agent's starting prior was, for each state-action pair (q i , b), that p i1 b , p i2 b , ..., p iN b ∼ Dir(α i1 b , α i2 b , ..., α iN b ). Then we can consider the random variables p i1 b (n) , p i2 b (n) , ..., p iN b (n) ∼ Dir(α i1 b + n Xi1 b , α i2 b + n Xi2 b , ..., α iN b + n XiN b ). Since n Xij b is the count of the number of times X ij b was 1 in the random sample, this new distribution is exactly the result of performing Bayesian inference on the prior given the random sample as our new data. We then let pij b (n) := E p ij b (n) = α ij b + n Xij b N k=1 α ik b + n Xik b , and we also define Σ (n) as the covariance matrix of the p ij b (n) over all i, j, b, namely Σ (n) =      Cov(p 11 b (n) , p 11 b (n) ) Cov(p 11 b (n) , p 12 b (n) ) ... Cov(p 12 b (n) , p 11 b (n) ) Cov(p 12 b (n) , p 12 b (n) ) . . . . . . Cov(p N N z (n) , p N N z (n) )      , From Lemma 1, we have √ n g m c (a)[p (n) ] -g m c (a)[ X] τ → 0 in probability, and this allows us to use the well-known Slutsky's Theorem (Slutsky, 1925) on equation 14 and equation 13 to show that √ n g m c (a)[p (n) ] -g m c (a)[t] τ → N (0, 1) in distribution. ( ) We must make one more modification to this result. Let (τ (n) ) 2 := ∇g m c (a)[p (n) ] T Σ (n) ∇g m c (a)[p (n) ] . We would like to show that n(τ (n) ) 2 → τ 2 in probability. To do this, first note that p(n) → t in probability, so since g m c (a) has continuous derivatives we have that (∇g m c (a)[p (n) ]) → (∇g m c (a)[t]) in probability. Next we note that nΣ (n) → Σ XX in probability. This is because for the (i, b 1 , j), (s, b 2 , t)-entry we have 0 → 0 if i = s or b 1 = b 2 , and otherwise we have nCov(p ij b (n) , p it b (n) ) = -n(α ij b + n Xij b )(α it b + n Xit b ) ( N k=1 (α ik b + n Xik b )) 2 (1 + N k=1 (α ik b + n Xik b )) = -n(α ij b + n Xij b )(α it b + n Xit b ) (n + N k=1 α ik b ) 2 (n + 1 + N k=1 α ik b ) → -t ij b t it b = Cov(X ij b , X it b ). Therefore we have that the products converge in probability: n(τ (n) ) 2 = ∇g m c (a)[p (n) ] T nΣ (n) ∇g m c (a)[p (n) ] → (∇g m c (a)[t]) T Σ XX (∇g m c (a)[t]) = τ 2 . Since τ 2 is always positive, and the square root function is therefore continuous at τ 2 , we have that √ nτ (n) → τ , and so τ √ nτ (n) → 1 in probability. Now we can finally apply Slutsky's Theorem to obtain our final result, which is g m c (a)[p (n) ] -g m c (a)[t] τ (n) → N (0, 1) in distribution. ( ) Recall that g m c (a)[t] is the actual risk in the current state q c , g m c (a)[p (n) ] is the agent's approximation of the expectation of the risk given it's beliefs, and (τ (n) ) 2 is the agent's approximation of the variance of the risk given it's beliefs (both, in this case, assuming it has seen exactly n transitions from each state). So indeed our estimate of the mean of the believed risk distribution converges to the true risk with enough data, and it does so with the variance of the believed risk distribution approaching our estimate of that variance. Lemma 1 Given the definition of the polynomial g m c (a)[x], we have the following: √ n g m c (a)[p (n) ] -g m c (a)[ X] τ → 0 in probability Proof. As required for the convergence results in Theorem 3.1, one can see that all of the coefficients in g m c (a)[x] are either 0 or 1. This means that we can rewrite it as a sum of terms of the form i,j,b x ij b n ij b for exponents n ij b . This means that we can write √ n g m c (a)[p (n) ] -g m c (a)[ X] τ as a sum of terms of the form √ n τ   i,j,b pij b (n) n ij b - i,j,b Xij b n ij b   . Substituting in the definition of pij (n) to this expression yields √ n τ    i,j,b   α ij b + n Xij b N k=1 α ik b + n Xik b   n ij b - i,j,b Xij b n ij b    And we can simplify this by leveraging that N k=1 n Xik b = n, to get √ n τ   i,j,b α ij b + n Xij b n + N k=1 α ik b n ij b - i,j,b Xij b n ij b   Now, the α ij b are constants, as is τ , and the values of Xij b are all bounded between 0 and 1. Thus to show that this expression converges to 0 in probability, we will write it as one quotient, and show that some term in the denominator dominates all terms in the numerator. Let M := i,j,b n ij b . The expression above is equal to √ n τ     i,j,b α ij b + n Xij b n ij b -i,j,b Xij b n + N k=1 α ik b n ij b i,j,b n + N k=1 α ik b n ij b     Now on the numerator of the inner quotient, there are only two terms of order n M . One is an  n M i,j (P m-1 [x]) ij =    1 if i = j and q i is unsafe and observed 0 if i = j and q i is unsafe and observed x ij where g 0 is the vector with entries (g 0 ) k := 1(q k is observed and unsafe). We can now define the vectors A n for n ≤ m by A n i := (P n-1 [t])(P n-2 [t])...(P 0 [t])g 0 i if q i is safely reachable from q c in exactly m -n steps 0 otherwise Where in this case a state q sn is defined to be safely reachable from the current state q s0 = q c in exactly n steps if • there are states q s1 , q s2 , ..., q sn-1 such that each t • the states q s1 , q s2 , ..., q sn-1 are all safe (note that q sn can still be unsafe) The purpose of these A n is just to restrict our attention to the states at step n of the backpropagation that actually influence g m c (a) [t] . It is easy to see that  Now we will be able to argue that if g m c (a) [t] is not equal to 0 or 1, there are states q i , q j , q l and an action b such that t ij b and t il b are both non-zero (so there is a positive probability of the events X ij b = 1 and X il b = 1) and such that So assume that g m c (a)[t] is not equal to 0 or 1. Let n 0 be the largest index such that A n0 contains an entry (A n0 ) l that is equal to 0 and such that q l is safely reachable from q c in exactly m -n 0 stepsso (A n0 ) l is a 0 that came from (P m-1 [t])((P m-2 [t])...(P 0 [t])g 0 ) l . Since g m c (a)[t] is not 0, n 0 < m, and since q l is safely reachable in m -n 0 steps, let q c = q s0 , q s1 , ..., q sm-n 0 = q l be a path along which q l is safely reachable. Then let q i = q sm-n 0 -1 , and we have that q i is safe, and t il bs m-n 0 -1 > 0. For brevity, write b := b sm-n 0 -1 Now since q i is safely reachable in m -(n 0 + 1) steps, (A n0+1 ) i cannot be equal to 0 (since n 0 was maximal), so there must be some state q j such that t ij b > 0 and A n0 j > 0, (in order for the term t ij b A n0 j to contribute some positive value to A n0+1 i ). Finally, let p be the probability of safely entering q i in m -(n 0 + 1) steps (i.e., the sum over all paths that safely reach q i of the probability of taking that path by choosing the actions specified by the agent's expected safest policy). Then by the chain rule, could increase the value of (A n0 ) l = (P n0-1 [t])...(P 0 [t])g 0 l from 0 to greater than 0, then t il b must have been 0 since (P n0-1 [t])...(P 0 [t])g 0 l is a sum of products of values from t, all of which are non-negative. Hence we have found states q i , q j , q l and an action b such that the derivatives The only detail left to note is that we assumed that g m c (a)[t] is not either equal to 0 or 1. This assumption is reasonable to make, because if it did not hold, then either our agent would be doomed to enter an unsafe state within m steps, or there is no chance of entering an unsafe state within m steps, according to the agent's expected safest actions. Since what matters to us is how the agent manages risk, situations involving risk 1 or risk 0 are irrelevant.

F APPENDIX F. CONFIDENCE BOUND ON THE RISK

To estimate a confidence bound on the risk, we appeal to the Cantelli Inequality, which is a onesided Chebychev bound (Cantelli, 1929) , and states that for a real-valued random variable R with expectation E[R] and variance V ar[R], for λ > 0 we have Pr(R ≤ E[R] + λ) ≥ 1 - V ar[R] V ar[R] + λ 2 If we let C := 1 -V ar [R] V ar[R]+λ 2 , then rearranging we get that λ = V ar[R]C 1-C . Thus for a variable R that represents some sort of risk, and for some value of 0 < C < 1, we can say Pr(R ≤ P ) ≥ C where P := E[R] + V ar[R]C 1-C . In words, "there is at least C chance that the risk is at most P ." Alternatively, "we are at least C 100 % confident that the risk is at most P ."



Figure 1: The number of steps it takes the agent to cross the bridge for every episode where it crosses. Averaged over 10 experiments. Results for Q-learning only and for RCRL across different priors and values of risk P max . As Q-learning converges, it approaches the lower bound on the optimal number of steps per episode.

Figure 2: (a) Slippery Gridworld setup: agent is represented by an arrow surrounded by the observation area (white line). Labels denote target (yellow), unsafe (red) and safe states (blue), and initial state (q 0 , green). (b) For a single experiment, number of state-visitations for Prior 1 at P max = 0.01. (c-d) Number of state-visitations, for Priors 2 and 3 at P max = 0.01.

where ∇ i b g m c (a)[p] is the gradient vector (∇g m c (a)[p]) restricted to entries ∂g m c (a) ∂x ij b for j = 1, ..., N .

b

on the right, and these cancel each other out. This means the numerator is entirely of order n M -1 or less. On the other hand, the denominator of the inner quotient contains the term where where b in := arg min b Rn i (b). Define P m-1 [x] as

a otherwise Then the P n [x] represent the transition probabilities used in the calculation of g m c (a)[x]. Specifically, we have that• g n k [x]is the kth entry of the vector(P n-1 [x])...(P 0 [x])g 0 for n < m • g m k (a)[x]is the kth entry of the vector(P m-1 [x])(P m-2 [x])...(P 0 [x])g 0 • So the risk at current state q c , g m c (a)[t],is the cth entry of the vector (P m-1 [t])(P m-2 [t])...(P 0 [t])g 0

for actions b s0 = a and b s k := arg min b Rm-k-1 sp (b) determined by the agent's expected safest policy, and

P m-1 [t])...(P n [t])A n c = g m c (a)[t]for every n = 0, 1, ..., m

∂g m c (a) ∂x ij b x=t = p 1 × A n0 j + t ij b × (P n0-1 [x])...(P 0 [x])g 0 j ∂x ij x=t > 0 since clearly ((P n 0 -1 [x])...(P 0 [x])g 0 ) j ∂xij x=t cannot be negative. On the other hand, n 0 -1 [x])...(P 0 [x])g 0 ) l ∂x il b x=tcan be nonzero -if increasing the value of t il b

Total successes and failures. Gridworld: different priors and acceptable risks P max , averaged over 10 agents. PacMan: varying risk horizon m, single agent.

annex

n M . Therefore, even after multiplying by the √ n τ on the outside, which would mean the highest order term on in the numerator could be as high as n M -1 2 , the n M in the denominator still dominates and the expression as a whole will converge to 0 in probability. Since √ n g m c (a)[p (n) ] -g m c (a)[ X] τ was a sum of expressions of that form, and they all converge to 0 in probability, we get the result we desired, which is that) is strictly greater than zero, namely τ 2 > 0.

Proof.

Note that the covariance matrix can be written as Σ XX = E[(X -t)(Xt) T ] (recall t is the mean vector for X). So we havewhere we note that s := (∇g m c (a)[t]) T (X -t) is a real-valued random variable, so s T = s. Thus to prove τ 2 > 0 we simply have to show that s = 0 for some value of X that occurs with non-zero probability. Now,, then s = state-action pairs (q i ,b) s i b . We need to show that there is some possible value of X such that s = 0. Now the value of X is determined by the values of X i b := (X ij b ) N j=1 for each state-action pair (q i , b). Furthermore, these X i b are independent, and the value of s i b depends only on the value of X i b . So if there is some state-action pair (q i , b) such that two possible values of X i b yield two distinct values of s i b both with nonzero probability, then we can fix the values of the X hj b for all j and all (h, b ) = (i, b) to be some values that occur with non-zero probability, which would fix the value of s -s i b , and so we could use our two distinct values of s i b to find two distinct values of s. Both cannot be 0, so we would be done. Now, the value of X i b is characterized by picking one j s.t. X ij b = 1, and setting all other X il b = 0 for l = j. This means that to find two different values of some s i b , we just need to find states q i , q j , q l and an action b such that the derivatives So long as the events X ij b = 1 and X il b = 1 both have nonzero probability, we would be done. In order to show that such states q i , q j , q l and such an action b exist, we must introduce vectors A n that will effectively keep track of each state's contribution towards g m c (a)[t] at the nth step of the risk backpropagation. First, define the N -by-N matrix P n [x] for n = 0, 1, ..., m -2 such thatif i = j and q i is unsafe and observed 0 if i = j and q i is unsafe and observedTo understand what exactly Rm c (a) is an approximation of, consider instead calculating this risk using the true transition probabilities t kj a , We would get r 0 k := 1(q k is observed and unsafe) (18)Note that we crucially still take the minimum risk action a according to the agent's approximation Rn+1 k (a). In this case, the term r m c (a) is the true probability of entering an unsafe state after selecting action a in the agent's current state q c and thereafter selecting the actions that the agent currently believes will minimize the probability of entering an unsafe state over the horizon m. Rm c (a) is the agent's approximation of r m c (a). We will later justify the use of Rm c (a) as an approximation of r m c (a), but for now let us consider why it makes sense to define m-step risk as r m c (a). This because the action a that minimizes believed risk is the action that the agent would choose if it was trying to behave as safely as possible, what I will call going into 'safety mode'. Consider the motivating example of a pilot learning to fly a remote control helicopter by incrementally expanding the set of actions they feels safe taking. They start by generating just enough lift to begin flying, then immediately land back down again. They repeat this process a few times until they feel that they have a good understanding of how the helicopter responds to this limited range of inputs. Then they take a risk (by either flying a bit higher, or attempting to move horizontally) and once again immediately land. As they repeat this process of taking small risks and landing to remain safe, they begin to expand their comfort zone. At some point after taking a risk, they will feel comfortable just coming back to a hovering position rather than landing, once they have become confident that they can hover safely. This suggests that a natural process for learning to operate in the face of risks is to repeatedly take small risks followed by going into safety mode until back in a confidently safe state. Thus, when calculating how risky an action is, it makes sense to consider the probability of entering an unsafe state given that after the action the agent will enter safety mode. r m c (a) does exactly this. As mentioned earlier, the other reason for defining the risk r m c (a) in this way is that it makes it possible for the agent to attempt to calculate the risk without having to reason about the inter-dependency between the calculated risk and the agent's future actions. However, it does more than this. We will see in the next section that it in fact allows the agent to view Rm c (a) as (an approximation of) the expected value of a random variable for the believed risk, where we can also approximate the variance of that random variable, allowing for deeper reasoning about action-selection for Safe RL.

