OBSERVATIONAL ROBUSTNESS AND INVARIANCES IN REINFORCEMENT LEARNING VIA LEXICOGRAPHIC OBJECTIVES

Abstract

Policy robustness in Reinforcement Learning (RL) may not be desirable at any price; the alterations caused by robustness requirements from otherwise optimal policies should be explainable and quantifiable. Policy gradient algorithms that have strong convergence guarantees are usually modified to obtain robust policies in ways that do not preserve algorithm guarantees, which defeats the purpose of formal robustness requirements. In this work we study a notion of robustness in partially observable MDPs where state observations are perturbed by a noiseinduced stochastic kernel. We characterise the set of policies that are maximally robust by analysing how the policies are altered by this kernel. We then establish a connection between such robust policies and certain properties of the noise kernel, as well as with structural properties of the underlying MDPs, constructing sufficient conditions for policy robustness. We use these notions to propose a robustness-inducing scheme, applicable to any policy gradient algorithm, to formally trade off the reward achieved by a policy with its robustness level through lexicographic optimisation, which preserves convergence properties of the original algorithm. We test the the proposed approach on safety-critical RL environments, and show how the proposed method helps achieve high robustness in observational noise problems.

1. INTRODUCTION

Robustness in Reinforcement Learning (RL) (Morimoto & Doya, 2005) can be looked at from different perspectives: (1) distributional shifts in the training data with respect to the deployment stage Satia & Lave Jr (1973) ; Heger (1994) ; Nilim & El Ghaoui (2005) ; Xu & Mannor (2006) ; (2) uncertainty in the model or observations (Pinto et al., 2017; Everett et al., 2021) ; (3) adversarial attacks against actions (Pattanaik et al., 2017; Fischer et al., 2019) ; and (4) sensitivity of neural networks (used as policy or value function approximators) towards input disturbances (Kos & Song, 2017; Huang et al., 2017) . Robustness does not naturally emerge in most RL settings, since agents are typically only trained in a single, unchanging environment: There is a trade-off between how robust a policy is and how close it is to the set of optimal policies in its training environment, and in safety-critical applications we may need to provide formal guarantees for this trade-off. Motivation Consider a complex dynamical system where we need to synthesise a controller (policy) through a model-free approach. When using a simulator for training we expect the deployment of the controller in the real system to be affected by different sources of noise, possibly not predictable or modelled (e.g. for networked components we may have sensor faults, communication delays, etc). In safety-critical systems, robustness (in terms of successfully controlling the system under disturbances) should preserve formal guarantees, and plenty of effort has been put on developing formal convergence guarantees on policy gradient algorithms (Agarwal et al., 2021; Bhandari & Russo, 2019) which vanish when "robustifying" policies through regularisation or adversarial approaches. Therefore, for such applications one would need a scheme to regulate the robustnessutility trade-off in RL policies that preserves the formal guarantees of original algorithms and retains sub-optimality conditions for the original problem. Lexicographic Reinforcement Learning Recently, lexicographic optimisation (Isermann, 1982; Rentmeesters et al., 1996) has been applied to the multi-objective RL setting (Skalse et al., 2022b) . In an LRL setting with different reward-maximising objective functions {K i } 1≤i≤n , some objectives may be more important than others, and so we may want to obtain policies that solve the multi-objective problem in a lexicographically prioritised way, i.e., "find the policies that optimise objective i (reasonably well), and from those the ones that optimise objective i + 1 (reasonably well), and so on". There exist both value-and policy-based algorithms for LRL, and the approach is broadly applicable to (most) state of the art RL algorithms (Skalse et al., 2022b) . Previous Work In robustness against model uncertainty, the MDP may have noisy or uncertain reward signals or transition probabilities, as well as possible resulting distributional shifts in the training data (Heger, 1994; Xu & Mannor, 2006; Fu et al., 2018; Pattanaik et al., 2018; Pirotta et al., 2013; Abdullah et al., 2019) , which connects to ideas on distributionally robust optimisation (Wiesemann et al., 2014; Van Parys et al., 2015) . One of the first examples is Heger (1994) , where the author proposes using minimax approaches to learn Q functions that minimise the worst case total discounted cost in a general MDP setting. Derman et al. (2020) propose a Bayesian approach to deal with uncertainty in the transitions. Another robustness sub-problem is studied in the form of adversarial attacks or disturbances by considering adversarial attacks on policies or action selection in RL agents (Gleave et al., 2020; Lin et al., 2017; Tessler et al., 2019; Pan et al., 2019; Tan et al., 2020; Klima et al., 2019) . Recently, Gleave et al. (2020) propose the idea that instead of modifying observations, one could attack RL agents by swapping the policy for an adversarial one at given times. For a detailed review on Robust RL see Moos et al. (2022) . Our work focuses in the study of robustness versus observational disturbances, where agents observe a disturbed state measurement and use it as input for the policy (Kos & Song, 2017; Huang et al., 2017; Behzadan & Munir, 2017; Mandlekar et al., 2017; Zhang et al., 2020; 2021) . In particular Mandlekar et al. (2017) consider both random and adversarial state perturbations, and introduce physically plausible generation of disturbances in the training of RL agents that make the resulting policy robust towards realistic disturbances. Zhang et al. (2020) propose a state-adversarial MDP framework, and utilise adversarial regularising terms that can be added to different deep RL algorithms to make the resulting policies more robust to observational disturbances, minimising the distance bound between disturbed and undisturbed policies through convex relaxations of neural networks to obtain robustness guarantees. Zhang et al. (2021) study how LSTM increases robustness with optimal state-perturbing adversaries.

1.1. MAIN CONTRIBUTIONS

Most existing work on RL with observational disturbances proposes modifying RL algorithms (learning to deal with perturbations through linear combinations of regularising loss terms or adversarial terms) that come at the cost of explainability (in terms of sub-optimality bounds) and verifiability, since the induced changes in the new policies result in a loss of convergence guarantees. Our main contributions are summarised in the following points.

Structure of Robust Policy Sets 1

We consider general unknown stochastic disturbances and formulate a quantitative definition of observational robustness that allows us to characterise the sets of robust policies for any MDP in the form of operator-invariant sets. We analyse how the structure of these sets depends on the MDP and noise kernel, and obtain an inclusion relation (i.e. Inclusion Theorem, Section 3) providing intuition into how we can search for robust policies more effectively. Verifiable Robustness through LRL While LRL is developed for reward maximising objectives, through the proposed observational robustness definition we can cast robustness as a lexicographic objective, allowing us to retain policy optimality up to a specified tolerance while maximising robustness and yielding a mechanism to formally control the performance-robustness trade-off. This preserves convergence guarantees of the original algorithm and yields formal bounds on policy suboptimality. We provide numerical examples on how this logic is applied to existing policy gradient Π Π 0 Π D Π T Π Π * π t ? (a) PG algorithms when robustness terms are added to the cost function indiscriminately. Π Π 0 Π D Π T Π Π * π t {π ∈ Π : J * -J(π) ≤ ϵ} (b) In LRPG, the policy is guaranteed (up to the original algorithm used) to converge to an ϵ ball of Π * , and from those, the most robust ones. Figure 1 : Qualitative representation of the proposed LRPG algorithm, compared to usual robustnessinducing algorithms. The sets in blue are the maximally robust policies to be defined in the coming sections. Through LRPG we guarantee that the policies will only deviate a bounded distance from the original objective, and induce a search for robustness in the resulting valid policy set. algorithms, compare with existing algorithms in previous work, and verify how the previously mentioned Inclusion Theorem helps to induce more robust policies while retaining algorithm optimality. Figure 1 represents a qualitative interpretation of the results in this work (the structure of the robust sets will become clear in following sections).

1.2. PRELIMINARIES

Notation We use calligraphic letters A for collections of sets and ∆(A) as the space of probability measures over A. For two elements of a vector space we use ⟨•, •⟩ as the inner product. We use 1 n as a column-vector of size n that has all entries equal to 1. We say that an MDP is ergodic if for any policy the resulting Markov Chain (MC) is ergodic. We say that S is a n × n row-stochastic matrix if S ij ≥ 0 and each row of S sums to 1. We use α, β, η for learning rates, µ for probability distributions and θ ∈ Θ for parameters. Lexicographic Reinforcement Learning We propose using policy-based LRL (PB-LRL) to encode the idea that, when learning how to solve an RL task, robustness is important but not at any price, i.e., we would like to solve the original objective reasonably well, and from those policies efficiently find the most robust onefoot_2 . Consider a parameterised policy π θ with θ ∈ Θ, and two objective functions K 1 and K 2 . PB-LRL uses a multi-timescale optimisation scheme to optimise θ faster for higher-priority objectives, iteratively updating the constraints induced by these priorities and encoding them via Lagrangian relaxation techniques (Bertsekas, 1997) . Let θ ′ ∈ arg max θ K 1 (θ). Then, PB-LRL can be used to find parameters: θ ′′ = arg max θ K 2 (θ) such that K 1 (θ) ≥ K 1 (θ ′ ) -ϵ. This is done by computing the (estimated) gradient ascent update: θ ← proj Θ θ + ∇ θ K(θ) , λ ← proj R ≥0 λ + η t ( k1 -ϵ t -K 1 (θ)) , where K(θ) := (β 1 t + λβ 2 t ) • K 1 (θ) + β 2 t • K 2 (θ), λ is a Langrange multiplier, β 1 t , β t , η t are learning rates, and k1 is an estimate of K 1 (θ ′ ). Typically, we set ϵ t → 0, though we can use other tolerances too, e.g., ϵ t = 0.9 • k1 . For more detail on the convergence proofs and particularities of PB-LRL we refer the reader to Skalse et al. (2022b) .

2. OBSERVATIONALLY ROBUST REINFORCEMENT LEARNING

We restrict the robustness problem considered in this work to the following version of a noiseinduced partially observable Markov Decision Process (Spaan, 2012) . Definition 1. An observationally-disturbed MDP (DOMDP) is (a POMDP) defined by the tuple (X, U, P, R, T, γ) where X is a finite set of states, U is a set of actions, P : U × X → ∆(X) is a probability measure of the transitions between states and R : X × U × X → R is a reward function. The map T : X → ∆(X) is a stochastic kernel induced by some unknown noise signal, such that T (y | x) is the probability of measuring y while the true state is x, and acts only on the state observations. At last γ ∈ [0, 1] is a reward discount factor. In a DOMDPfoot_3 agents can measure the full state, but the measurement will be disturbed by some unknown random signal in the policy roll-out phase. Unlike the POMDP setting the agent has access to the true state x during learning of the policies (the simulator is noise-free), and no information about the noise kernel T or a way to estimate it. The difficulty of acting in such DOMDP is that the transitions are actually undisturbed and a function of the true state x, but agents will have to act based on disturbed states x ∼ T (• | x). We then need to construct policies that will be as robust as possible against noise without being able to construct noise estimates. This is a setting that reflects many control problems; we can design a controller for ideal noise-less conditions, and we know that at deployment there will likely be noise, data corruption, adversarial perturbations, etc., but we do not have certainty on the disturbance structure. Remark 1. In Sections 2 and 3 we reason about the influence of T in the characterisation of robustness and robust policies. However, when trying to learn robust policies we will to introduce uncertainty in the training phase. How to add this uncertainty will become clear in further sections. A (memoryless) policy for the agent is a stochastic kernel π : X → ∆(U ). For simplicity, we overload notation on π, denoting by π(x, u) as the probability of taking action u at state x under the stochastic policy π in the MDP, i.e., π(x, u) = Pr{u | x}. The value function of a policy π, V π : X → R, is given by V π (x 0 ) = E[ ∞ t=0 γ t R(x t , π(x t ), x t+1 )]. The action-value function of π (Q-function) is given by Q π (x, u) = y∈X P (x, u, y)(R(x, u, y) + γV π (y)). It is well known that, under mild conditions (Sutton & Barto, 2018) , the optimal value function can be obtained by means of the Bellman equation V * (x) := max u y∈X P (x, u, y)(R(x, u, y) + γV * (y)), and an optimal policy is guaranteed to exist such that π * (x) := arg max π V π (x) ∀x ∈ X. We then define the objective function as J(π) := E x0∼µ0 [V π (x 0 )] with µ 0 being a distribution of initial states, and we use J * := max π J(π). If a policy is parameterised by θ ∈ Θ we write π θ and J(θ). Assumption 1. For any DOMDP and policy π, the resulting MC is irreducible and aperiodic. Assumption 1 ensures that for any DOMDP and policy π, there exists a stationary probability distribution of states µ π ∈ ∆(X), and for every policy and state this probability is larger than zero. We now formalise a notion of observational robustness. Firstly, due to the presence of the stochastic kernel T , the policy we are applying is altered as we are applying a collection of actions in a possibly wrong state. This behaviour can be formally captured by: Pr{u | x, π, T } = ⟨π, T ⟩(x, u) := y∈X T (y | x)π(y, u), where ⟨π, T ⟩ : X → ∆(U ) is the disturbed policy, which averages the current policy given the error induced by the presence of the stochastic kernel. Notice that ⟨•, T ⟩(x) : Π → ∆(U ) is an averaging operator yielding the alteration of the policy due to noise. We can then define the robustness regret: ρ(π, T ) := J(π) -J(⟨π, T ⟩). (3) Definition 2 (Policy Robustness). We say that a policy π is κ-robust against a stochastic kernel T if ρ(π, T ) ≤ κ. If π is 0-robust we say it is maximally robust. We define the sets of κ-robust policies, Π κ := {π ∈ Π : ρ(π, T ) ≤ κ}, with Π 0 being the set of maximally robust policies. One can motivate the characterisation and models above from a control perspective, where policies use as input discretised state measurements with possible sensor measurement errors. Formally ensuring robustness properties when learning RL policies will, in general, force the resulting policies to deviate from optimality in the undisturbed MDP. With this motivation, we propose solving the problem of increasing robustness of RL policies through a hierarchical lexicographic approach, which naturally incorporates trade-offs during the policy design. The first objective is to minimise the distance J * -J(π) up to some tolerance. Then, from the policies that satisfy this constraint, we want to steer the learning algorithm towards a maximally robust policy according to the metric defined in Definition 2. This can be formulated as the following problem, to be solved by means of LRL casting robustness as a valid lexicographic objective. Problem 1. For a DOMDP and a given tolerance level ϵ, derive a policy π ϵ that satisfies J * -J(π ϵ ) ≤ ϵ as a prioritised objective and is as robust as possible according to Definition 2.

3. CHARACTERISATION OF ROBUST POLICIES

An important question to be addressed, before trying to synthesise robust policies through LRL, is what these robust policies look like, and how they are related to DOMDP properties. The robustness notion in Definition 2 is intuitive and it allows us to classify policies. We begin by exploring what are the types of policies that are maximally robust, starting with the set of constant policies and set of fix point of the operator ⟨•, T ⟩, whose formal descriptions are now provided. Definition 3. A policy π : X → ∆(U ) is said to be constant if π(x) = π(y) for all x, y ∈ X, and the collection of all constant policies is denoted by Π. A policy π : X → ∆(U ) is called a fixed point of the operator ⟨•, T ⟩ if π(x) = ⟨π, T ⟩(x) for all x ∈ X. The collection of all fixed points will be denoted by Π T . In other words, a constant policy is any policy that yields the same action distribution for any state, and a fixed point policy is any policy whose action distributions are un-altered by the noise kernel. Observe furthermore that Π T only depends on the kernel T and the setfoot_4 X. We now present a proposition that links the two sets of policies in Definition 3 with our notion of robustness. Proposition 1. Consider a DOMDP as in Definition 1, the robustness notion given in Definition 2 and the concepts in Definition 3, then we have that Π ⊆ Π T ⊆ Π 0 . The importance of Proposition 1 is that it allows us to produce (approximately) maximally robust policies by computing the distance of a policy to either the set of constant policies or to the fix point of the operator ⟨•, T ⟩, and this is at the core of the construction in Section 4. However, before this, let us introduce another set that is sandwiched between Π 0 and Π T . Let us assume we have a policy iteration algorithm that employs an action-value function Q π and policy π. The advantage function for π is defined as A π (x, u) := Q π (x, u) -V π (x) and can be used as a maximisation objective to learn optimal policies (as in, e.g., A2C (Sutton et al., 1999) , A3C (Mnih et al., 2016) ). We can similarly define the noise disadvantage (a form of negative advantage) of policy π as: D π (x, T ) := V π (x) -E u∼⟨π,T ⟩(x) [Q π (x, u)], which measures the difference of applying at state x an action according to the policy π with that of playing an action according to ⟨π, T ⟩ and then continuing playing an action according to π. Our intuition says that if it happens to be the case that D π (x, T ) = 0 for all states in the DOMDP, then such a policy is maximally robust. And this is indeed the case, as shown in the next proposition. Proposition 2. Consider a DOMDP as in Definition 1 and the robustness notion as in Definition 2. If a policy π is such that D π (x, T ) = 0 for all x ∈ X, then π is maximally robust, i.e., let Π D := {π ∈ Π : µ π (x)D π (x, T ) = 0 ∀ x ∈ X}. then we have that Π D ⊆ Π 0 . So far we have shown that both the set of fixed points Π and the set of policies for which the disadvantage function is equal to zero Π D are contained in the set of maximally robust policies. More interesting is the fact that the inclusion established in Proposition 1 and the one in Proposition 2 can be linked in a natural way. We call this connection, which is the main result of this section, the Inclusion Theorem. Theorem 1 (Inclusion Theorem). For a DOMDP with noise kernel T , consider the sets Π, Π T , Π D and Π 0 . Then, the following inclusion relation holds: Π ⊆ Π T ⊆ Π D ⊆ Π 0 . Additionally, the sets Π, Π T are convex for all MDPs and kernels T , but Π D , Π 0 may not be. Let us reflect on the inclusion relationsfoot_5 of Theorem 1. The inclusions are in general not strict, and in fact the geometry of the sets (as well as whether some of the relations are in fact equalities) is highly dependent on the reward function, and in particular on the complexity (from an informationtheoretic perspective) of the reward function. As an intuition, less complex reward functions (more uniform) will make the inclusions above expand to the entire policy set, and more complex reward functions will make the relations collapse to equalities. The following Corollary illustrates this. Corollary 1. For any ergodic DOMDP there exist reward functions R and R such that the resulting DOMDP satisfies: (i) Π D = Π 0 = Π (any policy is max. robust) if R = R, (ii) Π T = Π D = Π 0 (only fixed point policies are maximally robust) if R = R. We can now summarise the insights from Theorem 1 and Corollary 1 in the following conclusions: (1) The set Π is maximally robust, convex and independent of the DOMDP, (2) The set Π T is maximally robust, convex, includes Π, and its properties only depend on T , (3) The set Π D includes Π T and is maximally robust, but its properties depend on the DOMDP.

4. ROBUSTNESS THROUGH LEXICOGRAPHIC OBJECTIVES

We have now characterised robustness in a DOMDP and explored the relation between the sets of policies that are robust according to the definition proposed. We have seen in the Inclusion Theorem that several classes of policies are maximally robust, and our goal now is to connect these results with lexicographic optimisation. To be able to apply LRL results to our robustness problem we need to first cast robustness as a valid objective to be maximised, and then show that a stochastic gradient descent approach would indeed find a global maximum of the objective, therefore yielding a maximally robust policy. Then, this robustness objective can be combined with a primary rewardmaximising objective K 1 (θ) = E x0∼µ0 [V π θ (x 0 ) ] and any algorithm with certified convergence to produce a solution to Problem 1. We do not know T In the introduction, we emphasised the motivation for this work came partially from the fact that we may not know T in reality, or have a way to estimate it. However, the theoretical results until now depend on T . Our proposed solution to this lies in the results of Theorem 1. If we use an assumed generator T with the smallest possible fixed point set (i.e. the constant policy set), the robustness lexicographic objective will drive the policy towards the set of fixed points of T , which will be included in the fixed points of T (from Theorem 1). We argue that this is reasonable since we expect that it will improve robustness for any noise structure. If we do have information about the noise generator, it may be sensible to pick a different T . For details on how T may relate to T see Appendix B.3. A reasonable choice for the stochastic kernel T discussed in the above paragraph is the uniform kernel, following the Principle of Maximum Entropy (when no information about T is available, we consider the maximum entropy distribution). In specific problems, other priors, adversarial noise, etc., may be more appropriate. We propose now a valid lexicographic objective for which a minimising solution yields a maximally robust policy. One of the messages of the Inclusion Theorem is the fact that fixed points and constant policies are maximally robust, the latter being completely oblivious to the choice of T , a relevant feature to the design of robust policies. Consider the optimisation problem minimise θ K T (θ) = x∈X µ π θ (x) 1 2 ∥π θ (x) -⟨π θ , T ⟩(x)∥ 2 2 , where we recall that π θ is a given parameterisation of the set of policies. Notice that the optimisation problem 5 projects the current policy onto the set of fixed points of the operator ⟨•, T ⟩, and due to Assumption 1, which requires µ π θ (x) > 0 for all x ∈ X, the optimal solution is equal to zero if and only if there exists a value of the parameter θ for which the corresponding π θ is a fixed point of ⟨•, T ⟩. In practice, the objectives are computed for a batch of trajectory sampled states X s ⊂ X, and averaged over 1 |Xs| ; we denote these approximations with a hat. By applying standard stochastic approximation arguments, we can prove that convergence is guaranteed for a SGD iteration using ∇ θ K T (θ)(x) = (π θ (x) -π θ (y))∇ θ π θ (x), y ∼ T (• | x) ( which is an unbiased estimator for the objective) to the optimal solution of problem 5. For details and proof, see Appendix B. Remark 2. The gradient approximation ∇ θ K T (θ)(x) is not the true gradient of K T . However, this approximation is sufficient to ensure convergence of the policy π θ to a fixed point of the operator ⟨•, T ⟩, provided we have a fully parameterised policy. Such an approximation is also easy to compute from sampled points x ∈ X both on-and off-policy. Other types of policy parameterisations may also yield a fixed point of ⟨•, T ⟩ if it is such that we can make the policy state independent, i.e., if there is a parameter θ for which π θ (x) = π θ (y) for all x, y ∈ X. This is the case, for example, when considering general neural network architectures if we set the weights to zero (but not the bias). Assumption 2 (Learning Rates). We assume all learning rates α t (x, u) ∈ [0, 1] satisfy the conditions ∞ t=1 α t (x, u) = ∞ and ∞ t=1 α t (x, u) 2 < ∞. Algorithm 1 LRPG 1: input Simulator, T , ϵ 2: initialise θ, critic (if using), λ, {β 1 t , β 2 t , η} 3: set t = 0, xt ∼ µ0 4: while t < max iterations do 5: perform ut ∼ π θ (xt) 6: observe rt, xt+1, sample y ∼ T (• | x) 7: if K1(θ) not converged then k1 ← K1(θ) 8: update critic (if using) 9: update θ and λ using equation 1 10: output θ Now, the convergence of PB-LRL algorithms is guaranteed as long as the original policy gradient algorithm (such as PPO (Liu et al., 2019) or A2C (Konda & Tsitsiklis, 2000; Bhatnagar et al., 2009) ) for each single objective converges Skalse et al. (2022b) . We can then combine Lemma 1 with these results to guarantee that Lexicographically Robust Policy Gradient (LRPG), Algorithm 1, converges to a policy that maximise robustness while remaining (approximately) optimal with respect to R. Theorem 2. Consider a DOMDP as in Definition 1 and let π θ be a parameterised policy. Take K 1 (θ) = E x0∼µ0 [V π θ (x 0 )] to be computed through a chosen algorithm (e.g., A2C, PPO) that optimises K 1 (θ), and let K 2 (θ) = -K T (θ). Given an ϵ > 0, if the iteration θ ← proj Θ θ + ∇ θ K1 is guaranteed to converge to a parameter set θ * that maximises K 1 , and hence J (locally or globally), then LRPG converges a.s. under PB-LRL conditions to parameters θ ϵ that satisfy: θ ϵ ∈ arg min θ∈Θ ′ K T (θ) such that K * 1 ≥ K 1 (θ ϵ ) -ϵ, where Θ ′ = Θ if θ * is globally optimal and a compact local neighbourhood of θ * otherwise. We reflect again on Figure 1 . The main idea behind LRPG is that by formally expanding the set of acceptable policies with respect to K 1 , we may find robust policies more effectively while guaranteeing a minimum performance in terms of expected rewards.

5. EXPERIMENTS

We verify the theoretical results of LRPG in a series of experiments on discrete state/action safetyrelated environments (Chevalier-Boisvert et al., 2018) . Minigrid-LavaGap, Minigrid-LavaCrossing are safe exploration tasks where the agent needs to navigate an environment with cliff-like regions and receives a reward of 1 when it finds a target. Minigrid-DynamicObstacles is a dynamic obstacleavoidance environment where the agent is penalised for hitting an obstacle, and gets a positive reward when finding a target. Minigrid-LavaGap is small enough to be fully observable, and the other two environments are partially observable. In all cases observations consist of a 7 × 7 field of view in front of the agent, with 3 channels encoding the color and state of objects in the environment. We use A2C (Sutton & Barto, 2018) and PPO (Schulman et al., 2017) for our implementations of LRPG which we denote by LR-PPO and LR-A2C, respectively. In all cases, the lexicographic tolerance was set to ϵ = 0.99 k1 to deviate as little as possible from the primary objective. Sampling T To simulate T we disturb x as x = x + ξ for (1) a uniform bounded noise signal ξ ∼ U [-b,b] ( T u ) with b = 2 (1.5 for LavaCrossing) and ( 2) and a Gaussian noise ( T g ) such that ξ ∼ N (0, 0.5). We test the resulting policies against a noiseless environment (∅), a kernel T 1 = T u and a kernel T 2 = T g . The main point of these combinations is to also test the policies when the true noise T is similar to T . General Robustness Results. Firstly, we investigate the robustness of four algorithms where we do not have a Q function. If we do not have an estimator for the critic Q π , Proposition 1 suggests that minimising the distance between π and ⟨π, T ⟩ can serve as a proxy to minimise the robustness regret. We consider the algorithms: 1. Vanilla PPO (noiseless). 3. LR-PPO with a Gaussian noise kernel (K g T ). 2. LR-PPO with a uniform noise kernel (K u T ). 4. SA-PPO from Zhang et al. (2020) . In these experiments, we use PPO with a neural policies and value functions; the architectures and hyper-parameters used in each case can be found in Appendix C. The results are summarised in the left-hand side of Table 1 . Each entry is the median of 10 independent training processes, with reward values measured as the mean of 100 independent trajectories. Robustness through Disadvantage Objectives. If we have an estimator for the critic Q π we can obtain robustness without inducing regularity in the policy using D π , yielding a larger policy subspace to steer towards, and hopefully achieving policies closer to optimal. With the goal of diving deeper into the results of Theorem 1, we consider the objective: K D (θ) := x∈X µ π θ (x) 1 2 ∥D π θ (x, T )∥ 2 2 . We aim to test the hypothesis introduced through this work: by setting K 2 = K D and thus aiming to minimise the disadvantage D, we may obtain policies that yield better robustness with similar expected rewards. Observe that π D ∈ Π D =⇒ K D (π D ) = 0. To test this, we compare the following algorithms on the same environments: 1. Vanilla A2C (noiseless). 3. LR-A2C with K g T . 2. LR-A2C with K u T . 4. LR-A2C with K 2 = K D . We use A2C in this case since the structure of the original cost functions are simpler than PPO, and hence easier to compare between the scenarios above, and we modified A2C to retain a Q function as a critic. With each objective function resulting in gradient descent steps that pull the policy towards different maximally robust sets (K T → Π T and K D → Π D respectively), we would expect to obtain increasing robustness for K D . The results are presented in the right-hand side of Table 1 . 

Noise

Vanilla LRPPO(K u T ) LRPPO(K g T ) SA-PPO Vanilla LRA2C(K u T ) LRA2C(K g T ) LRA2C(K D ) LavaGap ∅ 0.

6. DISCUSSION

Experiments. We applied LRPG on PPO and A2C algorithms, for a set of discrete action, discrete state grid environments. These environments are particularly sensitive to robustness problems; the rewards are sparse, and applying a sub-optimal action at any step of the trajectory often leads to terminal states with zero (or negative) reward. LRPG successfully induces lower robustness regrets in the tested scenarios, and the use of K D as an objective (even though we did not prove the convergence of a gradient based method with such objective) yields a better compromise between robustness and rewards. When compared to recent observational robustness methods, LRPG obtains similar robustness results while preserving the original guarantees of the chosen algorithm (it even outperforms in some cases, although this is probably highly problem dependent, so we do not claim an improvement for every DOMDP). Further Considerations on LRPG. The characterisation of robustness as a policy being invariant to a stochastic operator may be useful for other versions of robustness in RL. For example, in robustness against transition probability disturbances (or distributional shifts), one may consider distribution ambiguity sets and exploit distributionally robust optimisation ideas to investigate policy invariances. In this case, investigating the structure of maximally robust policies may yield a mechanism to design RL algorithms that are generally robust to model uncertainties.

Shortcomings

The motivation for LRPG comes partially from the situation where, when deploying a model free controller in a complex dynamical system, we may not have a feasible way of estimating the noise generator. There is an alternative approach for robust RL (exploited in most of the literature), which consists on taking a disturbance structure (e.g. adversarial noise) and training directly to optimise the rewards in the disturbed MDP. Apart from LRPG preserving formal guarantees, there is no clear answer over what approach is more rational, or more effective in general. The choice would depend on the problem at hand, the possible existence of an adversary, the requirement (or lack thereof) for formal guarantees, etc. We cannot claim that our approach is better in every way; we simply show through this work that it is a useful approach for learning policies in specific problems where, for example, we need to control dynamical system where the noise sources are unknown and we need to retain certain formal guarantees of the algorithms used. Robustness, Complexity and Invariances. Sections 2 and 3 discuss at large the structure, shape and dependence of the maximally robust policy sets. These insights help derive optimisation objectives to use in LRPG, but there is more to be said about how policy robustness is affected by the underlying MDP properties. We hint at this in the proof of Corollary 1. More regular (less complex in entropy terms, or more symmetric) reward functions (e.g., reward functions with smaller variance across the actions R(x, •, y)) seem to induce larger robust policy sets. In other words, for a fixed policy, a more complex reward function yields larger robustness regrets as soon as any noise is introduced in the system. This raises questions on how to use these principles to derive more robust policies in a comprehensive way, but we leave these questions for future work. Additionally, one could extend these ideas to use LRL to obtain policies that generalise to a subclass of reward functions.

B.1 AUXILIARY RESULTS

Theorem 3 (Stochastic Approximation with Non-Expansive Operator). Let {ξ t } be a random sequence with ξ t ∈ R n defined by the iteration: ξ t+1 = ξ t + α t (F (ξ t ) -ξ t + M t+1 ), where: 1. The step sizes α t satisfy Assumption 2. 2. F : R n → R n is a ∥ • ∥ ∞ non-expansive map. That is, for any ξ 1 , ξ 2 ∈ R n , ∥F (ξ 1 ) - F (ξ 2 )∥ ∞ ≤ ∥ξ 1 -ξ 2 ∥ ∞ . 3. {M t } is a martingale difference sequence with respect to the increasing family of σ-fields F t := σ(ξ 0 , M 0 , ξ 1 , M 1 , ..., ξ t , M t ). Then, the sequence ξ t → ξ * almost surely where ξ * is a fixed point such that F (ξ * ) = ξ * . Proof. See Borkar & Soumyanatha (1997) . Theorem 4 (PB-LRL Convergence). Let M be a multi-objective MDP with objectives K i , i ∈ {1, ..., m} of the same form. Assume a policy π is twice differentiable in parameters θ, and if using a critic V i assume it is continuously differentiable on w i . Suppose that if PB-LRL is run for T steps, there exists some limit point w * i (θ) when θ is held fixed under conditions C on M, π and V i . If lim T →∞ E t [θ] ∈ Θ ϵ 1 for m = 1, then for any m ∈ N we have lim T →∞ E t [θ] ∈ Θ ϵ m where ϵ depends on the representational power of the parameterisations of π, V i . Proof Sketch. We refer the interested reader to Skalse et al. (2022b) for a full proof, and here attempt to provide the intuition behind the result in the form of a proof sketch. Let us begin by briefly recalling the general problem statement: we wish to take a multi-objective MDP M with m objectives, and obtain a lexicographically optimal policy (one that optimises the first objective, and then subject to this optimises the second objective, and so on). More precisely, for a policy π parameterised by θ, we say that π is (globally) lexicographically ϵ-optimal if θ ∈ Θ ϵ m , where Θ ϵ 0 = Θ is the set of all policies in M, Θ ϵ i+1 := {θ ∈ Θ ϵ i | max θ ′ ∈Θ ϵ i K i (θ ′ ) -K i (θ) ≤ ϵ i }, and R m-1 ∋ ϵ ≽ 0. 6 The basic idea behind policy-based lexicographic reinforcement learning (PB-LRL) is to use a multitimescale approach to first optimise θ using K 1 , then at a slower timescale optimise θ using K 2 while adding the condition that the loss with respect to K 1 remains bounded by its current value, and so on. This sequence of constrained optimisations problems can be solved using a Lagrangian relaxation (Bertsekas, 1999) , either in series or -via a judicious choice of learning rates -simultaneously, by exploiting a separation in timescales (Borkar, 2008) . In the simultaneous case, the parameters of the critic w i (if using an actor-critic algorithm, if not this part of the argument may be safely ignored) for each objective are updated on the fastest timescale, then the parameters θ, and finally (i.e., most slowly) the Lagrange multipliers for each of the remaining constraints. The proof proceeds via induction on the number of objectives, using a standard stochastic approximation argument (Borkar, 2008) . In particular, due to the learning rates chosen, we may consider those more slowly updated parameters fixed for the purposes of analysing the convergence of the more quickly updated parameters. In the base case where m = 1, we have (by assumption) that lim T →∞ E t [θ] ∈ Θ ϵ 1 . This is simply the standard (non-lexicographic) RL setting. Before continuing to the inductive step, Skalse et al. (2022b) observe that because gradient descent on K 1 converges to globally optimal stationary point when m = 1 then K 1 must be globally invex (where the opposite implication is also true) (Ben-Israel & Mond, 1986a). 7 The reason this observation is useful is that because each of the objectives K i shares the same functional form, they are all invex, and furthermore, invexity is conserved under linear combinations and the addition of scalars, meaning that the Lagrangian formed in the relaxation of each constrained optimisation problem is also invex. As a result, if we assume that lim T →∞ E t [θ] ∈ Θ ϵ i as our inductive hypothesis, then the stationary point of the Lagrangian for optimising objective K i+1 is a global optimum, given the constraints that it does not worsen performance on K 1 , . . . , K i . Via Slater's condition (Slater, 1950) and standard saddle-point arguments (Bertsekas, 1999; Paternain et al., 2019) , we therefore have that lim T →∞ E t [θ] ∈ Θ ϵ i+1 , completing the inductive step, and thus the overall inductive argument. This concludes the proof that lim T →∞ E t [θ] ∈ Θ ϵ m . We refer the reader to Skalse et al. (2022b) for a discussion of the error ϵ, but intuitively it corresponds to a combination of the representational power of θ, the critic parameters w i (if used), and the duality gap due to the Lagrangian relaxation (Paternain et al., 2019) . In cases where the representational power of the various parameters is sufficiently high, then it can be shown that ϵ = 0. Lemma 1. Let π θ be a fully-parameterised policy in a DOMDP, and α t a learning rate satisfying Assumption 2. Consider the following approximated gradient for objective K T (π) and sampled point x ∈ X: ∇ θ K T (θ)(x) = (π θ (x) -π θ (y))∇ θ π θ (x), y ∼ T (• | x). (9) Then, the following iteration with x ∈ X and some initial θ 0 , θ t+1 = θ t -α t ∇ θ K T (θ t ) yields θ → θ almost surely where θ satisfies K T ( θ) = 0. Proof. See Appendix B.2.

B.2 PROOFS

We now present the proofs for the statements through the work. Proposition 1. If a policy π ∈ Π is a fixed point of the operator ⟨•, T ⟩, then it holds that ⟨π, T ⟩ = π. Therefore, one can compute the robustness of the policy π to obtain ρ(π, T ) = J(π) -J(⟨π, T ⟩) = J(π) -J(π) = 0 =⇒ π ∈ Π 0 . Therefore, Π T ⊆ Π 0 . For a discrete state and action spaces, the space of stochastic kernels K : X → ∆(X) is equivalent to the space of row-stochastic |X| × |X| matrices, therefore one can write T (y | x) ≡ T xy as the xy-th entry of the matrix T . Then, the representation of a constant policy as an X × U matrix can be written as π = 1 |X| v ⊤ , where 1 |X| where v ∈ ∆(U ) is any probability distribution over the action space. Observe that, applying the operator ⟨π, T ⟩ to a constant policy yields: ⟨π, T ⟩ = T 1 |X| v ⊤ . By the Perron-Frobenius Theorem (Horn & Johnson, 2012), since T is row-stochastic it has at least one eigenvalue eig(T ) = 1, and this admits a (strictly positive) eigenvector T 1 |X| = 1 |X| . Therefore, substituting this in equation 11: ⟨π, T ⟩ = T 1 |X| v ⊤ = 1 |X| v ⊤ = π =⇒ Π ⊆ Π T . Proposition 2. Recall the definition in equation 2 and that the noise disadvantage function of a policy π is given by equation 4. We want to show that D π (x, T ) = 0 =⇒ ρ(π, T ) = 0. Taking D π (x, T ) = 0 one has a policy that produces an disadvantage of zero when noise kernel T is applied. Then, D π (x, T ) = 0 =⇒ E u∼⟨π,T ⟩(x) [Q π (x, u)] = V π (x) ∀ x ∈ X. (12) Now define the value of the disturbed policy V ⟨π,T ⟩ (x 0 ) := E u k ∼⟨π,T ⟩(x k ), x k+1 ∼P (•|x k ,u k ) ∞ k=0 γ k r(x k , u k ) , and take: V ⟨π,T ⟩ (x) = E u∼⟨π,T ⟩(x), y∼P (•|x,u) r(x, u, y) + γV ⟨π,T ⟩ (y) . We will now show that V π (x) = V ⟨π,T ⟩ (x), for all x ∈ X. Observe, from equation 12 using V π (x) = E u∼⟨π,T ⟩(x) [Q π (x, u)], we have ∀x ∈ X: V π (x) -V ⟨π,T ⟩ (x) =E u∼⟨π,T ⟩(x) [Q π (x, u)] -E u∼⟨π,T ⟩(x) y∼P (•|x,u) r(x, u, y) + γV ⟨π,T ⟩ (y) =E u∼⟨π,T ⟩(x) y∼P (•|x,u) r(x, u, y) + γV π (y) -r(x, u, y) -γV ⟨π,T ⟩ (y) =γE y∼P (•|x,u) V π (y) -V ⟨π,T ⟩ (y) . Now, taking the sup norm at both sides of equation 13 we get ∥V π (x) -V ⟨π,T ⟩ (x)∥ ∞ = γ E y∼P (•|x,u) V π (y) -V ⟨π,T ⟩ (y) ∞ . Observe that for the right hand side of equation 14, we have E y∼P (•|x,u) V π (y) -V ⟨π,T ⟩ (y) ∞ ≤ ∥V π (x) -V ⟨π,T ⟩ (x)∥ ∞ . Therefore, since γ < 1, ∥V π (x) -V ⟨π,T ⟩ (x)∥ ∞ ≤ γ∥V π (x) -V ⟨π,T ⟩ (x)∥ ∞ =⇒ ∥V π (x) -V ⟨π,T ⟩ (x)∥ ∞ = 0. (15) Finally, ∥V π (x) -V ⟨π,T ⟩ (x)∥ ∞ = 0 =⇒ V π (x) -V ⟨π,T ⟩ (x) = 0 ∀x ∈ X, and V π (x) - V ⟨π,T ⟩ (x) = 0 ∀ x ∈ X =⇒ J(π) = J(⟨π, T ⟩) =⇒ ρ(π, T ) = 0. Inclusion Theorem 1. Combining Proposition 1 and Proposition 2, we simply need to show that Π T ⊂ Π D . Take π to be a fixed point of ⟨π, T ⟩. Then ⟨π, T ⟩ = π, and from the definition in equation 4: D π (x, T ) =V π (x) -E u∼⟨π,T ⟩(x,•) [Q π (x, u)] =V π (x) -E u∼π(x,•) [Q π (x, u)] =V π (x) -V π (x) =0. Therefore, π ∈ Π D , which completes the sequence of inclusions. To show convexity of Π, Π T , first for a constant policy π ∈ Π, recall that we can write π = 1v ⊤ , where v ∈ ∆(U ) is any probability distribution over the action space. Now take π 1 , π 2 ∈ Π. For any α ∈ [0, 1], απ 1 + (1 -α)π 2 = α1v ⊤ 1 + (1 -α)1v ⊤ 2 = 1(αv 1 + (1 -α)v 2 ) ⊤ ∈ Π. At last, for the set Π T , assume there exist two different policies π 1 , π 2 both fixed points of ⟨•, T ⟩. Then, for any α ∈ [0, 1], ⟨(απ 1 + (1 -α)π 2 ), T ⟩ = αT π 1 + (1 -α)T π 2 = απ 1 + (1 -α)π 2 . Therefore, any affine combination of fixed points is also a fixed point. Corollary 1. For statement (i), let R(•, •, •) = c for some constant c ∈ R. Then, J(π) = E x0∼µ0 [ t γ t r t | π] = cγ 1-γ , which does not depend on the policy π. For any noise kernel T and policy π, J(π ) -J⟨π, T ⟩ = 0 =⇒ π ∈ Π 0 . For statement (ii) assume ∃π ∈ Π 0 : π / ∈ Π T . Then, ∃x * ∈ X and u * ∈ U such that π(x * , u * ) ̸ = ⟨π, T ⟩(x * , u * ). Let: R(x, u, x ′ ) := c if x = x * and u = u * 0 otherwise . Then, E[R(x, π(x), x ′ ] < E[R(x, ⟨π, T ⟩(x), x ′ ] and since the MDP is ergodic x is visited infinitely often and J(π) -J(⟨π, T ⟩) > 0 =⇒ π / ∈ Π 0 , which contradicts the assumption. Therefore, Π 0 \ Π T = ∅ =⇒ Π 0 = Π T . Lemma 1. We make use of standard results on stochastic approximation with non-expansive operators (specifically, Theorem 3 in the appendix) Borkar & Soumyanatha (1997) . First, observe that for a fully parameterised policy, one can assume to have a tabular representation such that π θ (x, u) = θ xu , and ∇ θ π θ (x) ≡ Id. We can then write the stochastic gradient descent problem in terms of the policy. Let y ∼ T (• | x). Then: π t+1 (x) = π t (x) -α t π t (x) -π t (y) = = π t (x) -α t π t (x) -⟨π t , T ⟩(x) -π t (y) -⟨π t , T ⟩(x) . We now need to verify that the necessary conditions for applying Theorem 3 hold. First, α t satisfies Assumption 2. Second, making use of the property ∥ T ∥ ∞ = 1 for any row-stochastic matrix T , for any two policies π 1 , π 2 ∈ Π: ∥⟨π 1 , T ⟩ -⟨π 2 , T ⟩∥ ∞ = ∥ T π 1 -T π 2 ∥ ∞ = ∥ T (π 1 -π 2 )∥ ∞ ≤ ∥ T ∥ ∞ ∥π 1 -π 2 ∥ ∞ = ∥π 1 -π 2 ∥ ∞ . Therefore, the operator ⟨•, T ⟩ is non-expansive with respect to the sup-norm. For the final condition, we have E y∼ T (•|x) π t (y) -⟨π t , T ⟩(x) | π t , T = y∈X T (y | x)π t (y) -⟨π t , T ⟩(x) = 0. Therefore, the difference π t (y) -⟨π t , T ⟩(x) is a martingale difference for all x. One can then apply Theorem 3 with ξ t (x) ≡ π t (x), F (•) ≡ ⟨•, T ⟩ and M t+1 ≡ π t (y) -⟨π t , T ⟩(x) to conclude that π t (x) → π(x) almost surely. Finally from assumption 1, for any policy all states x ∈ X are visited infinitely often, therefore π t (x) → π(x)∀x ∈ X =⇒ π t → π and π satisfies ⟨π, T ⟩ = π, and K T (π) = 0. Theorem 2. We apply the results from Skalse et al. (2022b) in Theorem 4. Essentially, Skalse et al. (2022b) prove that for a policy gradient algorithm to lexicographically optimise a policy for multiple objectives, it is a sufficient condition that the stochastic gradient descent algorithm finds optimal parameters for each of the objectives independently. From Lemma 1 we know that a policy gradient algorithm using the gradient estimate in equation 9 converges to a maximally robust policy, i.e. a set of parameters θ ′ = arg max θ K T . Additionally, by assumption, the chosen algorithm for K 1 converges to an optimal point θ * . While the two objective functions are not of the same form -as in Skalse et al. (2022b) -the fact they are both invex (Ben-Israel & Mond, 1986b) either locally or globally depending on the form of K 1 , implies that K is also invex and hence that the stationary point θ ϵ computed by LRPG satisfies equation 6.

B.3 ON ADVERSARIAL DISTURBANCES AND OTHER NOISE KERNELS

A problem that remains open after this work is what constitutes an appropriate choice of T , and what can we expect by restricting a particular class of T . We first discuss adversarial examples, and then general considerations on T versus T . Adversarial Noise As mentioned in the introduction, much of the previous work focuses on adversarial disturbances. We did not directly address this in the results of this work since our motivation lies in the scenarios where the disturbance is not adversarial and is unknown. However, following the results of Section 3, we are able to reason about adversarial disturbances. Consider an adversarial map T adv to be ⟨π, T adv ⟩(x) = π(y), y ∈ argmax y∈X ad (x) d π(x), π(y) , with X ad (x) ⊆ X being a set of admissible disturbance states for x, and d(•, •) is a distance measure between distributions (e.g. 2-norm). Proposition 3. Constant policies are a fixed point of T adv , and are the only fixed points if for all pairs x 0 , x k there exists a sequence {x 0 , ..., x k } ⊆ X such that x i ∈ X ad (x i ). Proof. First, it is straight-forward that if π ∈ Π =⇒ ⟨π, T adv ⟩(x) = π(x). To show they are the only fixed points, assume that there is a non-constant policy π ′ that is a fixed point of T ad . Then, there exists x, z such that π ′ (x) ̸ = π ′ (z). However, by assumption, we can construct a sequence {x, ..., z} ⊆ X that connects x and z and every state in the sequence is in the admissible set of the previous one. Assume without loss of generality that this sequence is {x, y, z}. Then, if π ′ is a fixed point, ⟨π ′ , T adv ⟩(x) = π ′ (x), ⟨π ′ , T adv ⟩(y) = π ′ (y) and ⟨π ′ , T adv ⟩(z) = π ′ (z). However, π ′ (x) ̸ = π ′ (z), so either π ′ (x) ̸ = π ′ (y) =⇒ d(π ′ (x), π ′ (y)) ̸ = 0 or π ′ (y) ̸ = π ′ (z) =⇒ d(π ′ (y), π ′ (z)) ̸ = 0, therefore π ′ cannot be a fixed point of T adv . The main difference between an adversarial operator and the random noise considered throughout this work is that T adv is not a linear operator, and additionally, it is time varying (since the policy is being modified at every time step of the PG algorithm). Therefore, including it as a LRPG objective would invalidate the assumptions required for LRPG to retain formal guarantees of the original PG algorithm used, and it is not guaranteed that the resulting policy gradient algorithm would converge. Assumption of Noise Kernel A question emerging from Section 4 is how to choose T , and how the choice influences the resulting policy robustness towards any other true T . In general, for any arbitrary policy landscape with respect to utility in a given MDP, there is no way of bounding the distance of resulting policies for two different noise kernels T 1 , T 2 . As a counter-example, consider a MDP where there are 2 possible optimal policies π * 1 , π * 2 , and take these two policies to be maximally different, i.e. ∥π * 1 (x)π * 2 (x)∥ ∞ = 1 ∀x ∈ X. Then, when using LRPG to obtain a robust policy, a slight deviation in the choice of T can cause the gradient descent scheme to deviate from converging to π * 1 to converging to π * 2 , yielding in principle a completely different policy. However, what remains bounded is the optimality of the policy: Through LRPG guarantees we know that, for both cases, the resulting policy will be at most ϵ far from the optimal sum of rewards. We can, however, state the following. Take T to be any arbitrary noise kernel, and T a kernel that satisfies Π T ≡ Π. Let π to be a policy resulting from a LRPG algorithm. Assume, for a distance metric d, that min π ′ ∈Π T d(π, π ′ ) ≤ a for some a < 1. Then, it holds for any T that min π ′ ∈Π T d(π, π ′ ) ≤ a: That is, the resulting policy is at most a far from the set of fixed points (and therefore a maximally robust policy) with respect to the true T . This is the key argument behind our choices for T : A priori, the most sensible choice is a kernel that has no other fixed point than the set of constant policies.

C EXPERIMENTS METHODOLOGY

We use in the experiments well-tested implementations of A2C and PPO adapted from Zhang (2018) to include the computation of the lexicographic parameters in equation 1. Since all the environments use a pixel representation of the observation, we use a shared representation for the value function and policy, where the first component is a convolutional network, implemented as in Zhang (2018) For the implementation of the LRPG versions of the algorithms, in all cases we allow the algorithm to iterate for 1/3 of the total steps before starting to compute the robustness objectives. In other words, we use K(θ) = K 1 (θ) until t = 1 3 max steps, and from this point we resume the lexicographic robustness computation as described in Algorithm 1. This is due to the structure of the environments simulated. The rewards (and in particular the positive rewards) are very sparse in the environments considered. Therefore, when computing the policy gradient steps, the loss for the primary objective is practically zero until the environment is successfully solved at least once. If we implement the combined lexicographic loss from the first time step, many times the algorithm would converge to a (constant) policy without exploring for enough steps, leading to convergence towards a maximally robust policy that does not solve the environment. Noise Kernels. We consider two types of noise; a normal distributed noise T g and a uniform distributed noise T u . For the environments LavaGap and DynamicObstacles, the kernel T u produces a disturbed state x = x + ξ where ∥ξ∥ ∞ ≤ 2, and for LavaCrossing ∥ξ∥ ∞ ≤ 1.5. The normal distributed noise is in all cases N (0, 0.5). The maximum norm of the noise is quite large, but this is due to the structure of the observations in these environments. The pixel values are encoded as integers 0 -9, where each integer represents a different feature in the environment (empty space, doors, lava, obstacle, goal...). Therefore, any noise ∥ξ∥ ∞ ≤ 0.5 would most likely not be enough to confuse the agent. On the other hand, too large noise signals are unrealistic and produce pathological environments. All the policies are then tested against two "true" noise kernels, T 1 = T u and T 2 = T g . The main reason for this is to test both the scenarios where we assume a wrong noise kernel, and the case where we are training the agents with the correct kernel. LRPG Parameters. The LRL parameters are initialised in all cases as β 1 0 = 2, β 2 0 = 1, λ = 0 and η = 0.001. The LRL tolerance is set to ϵ t = 0.99 k1 to ensure we never deviate too much from the original objective, since the environments have very sparse rewards. We use a first order approximation to compute the LRL weights from the original LMORL implementation. Comparison with SA-PPO. One of the baselines included is the State-Adversarial PPO algorithm proposed in Zhang et al. (2020) . The implementation includes an extra parameter that multiplies the regularisation objective, k ppo . Since we were not able to find indications on the best parameter for discrete action environments, we implemented k ppo ∈ {0.1, 1, 2} and picked the best result for each entry in Table 1 . Larger values seemed to de-stabilise the learning in some cases. The rest of the parameters are kept as in the vanilla PPO implementation. C.1 EXTENDED RESULTS: ADVERSARIAL DISTURBANCES Even though we do not use an adversarial attacker or disturbance in our reasoning through this work, we implemented a policy-based state-adversarial noise disturbance to test the benchmark algorithms against, and evaluate how well each of the methods reacts to such adversarial disturbances. Adversarial Disturbance We implement a bounded policy-based adversarial attack, where at each state x we maximise for the KL divergence between the disturbed and undisturbed state, such that the adversarial operator is: T ε adv (y | x) = 1 =⇒ y ∈ arg max x D KL (π(x), π(x)) s.t. ∥x -x∥ 2 ≤ ε. The optimisation problem is solved at every point by using a Stochastic Gradient Langevin Dynamics (SGLD) optimiser. The results are presented in Table 5 . This type of adversarial attack with SGLD optimiser was proposed in Zhang et al. (2020) . As one can see, the adversarial disturbance is quite successful at severely lowering the obtained rewards in all scenarios. Additionally, as expected SA-PPO was the most effective at minimizing the disturbance effect (as it is trained with adversarial disturbances), although LRPG produces reasonably robust policies against this type of disturbances as well. At last, A2C appears to be much more sensitive to adversarial disturbances than PPO, indicating that the policies produced by PPO are by default more robust than A2C.



We claim novelty on the application of such concepts to the understanding and improvement of robustness in disturbed observation RL. Although we have not found our results in previous work, there are strong connections between Sections 2-3 in this paper and the literature on planning for POMDPs(Spaan & Vlassis, 2004;Spaan, 2012) and MDP invariances(Ng et al., 1999;van der Pol et al., 2020; Skalse et al., 2022a). The advantage of LRL is that we need not know in advance how to define "reasonably well" for each new task. Additionally, robustness through LRL provides us with a hyper-parameter that directly controls the trade-off between robustness and optimality: the optimality tolerance ϵ. By selecting values of ϵ we determine how far we allow our resulting policy to be from an optimal policy in favour of it being more robust. Definition 1 is a generalised form of the State-Adversarial MDP used byZhang et al. (2020): the adversarial case is a particular form of DOMDP where T is a probability measure that assigns probability 1 to one state. There is a (natural) bijection between the set of constant policies and the space ∆(U ). The set of fixed points of the operator ⟨•, T ⟩ also has an algebraic characterisation in terms of the null space of the operator Id(•) -⟨•, T ⟩. We are not exploiting the later characterisation in this paper. The above inclusions are equalities for some MDPs. See Appendix A for examples. The proof inSkalse et al. (2022b) also considers local lexicographic optima, though for the sake of simplicity, we do not do so here. A differentiable function f : R n → R is (globally) invex if and only if there exists a function g :R n × R n → R n such that f (x1) -f (x2) ≥ g(x1, x2) ⊤ ∇f (x2) for all x1, x2 ∈ R n (Hanson, 1981).



Figure 2: Screenshots of the environments used.

Reward values gained by LRPG and baselines.

. The hyper-parameters of the neural representations are presented in Table2. Shared Observation LayersThe actor and critic layers, for both algorithms, are a fully connected layer with 256 features as input and the corresponding output. We used in all cases an Adam optimiser. We optimised the parameters for each (vanilla) algorithm through a quick parameter search, and apply the same parameters for the Lexicographically Robust versions.

PPO Parameters

26±0.269 0.79±0.157 0.68±0.144 0.84±0.150 0.19±0.284 0.35±0.197 0.23±0.370 0.10±0.379 T 2 adv -0.49±0.312 0.51±0.234 0.33±0.202 0.55±0.170 -0.54±0.209 -0.21±0.192 -0.53±0.261 -0.51±0.260 Extended Reward Results.

A EXAMPLES AND FURTHER CONSIDERATIONS

We provide here two examples to show how we can obtain limit scenarios Π 0 = Π (any policy is maximally robust) or Π 0 = Π T (Example 1), and how for some MDPs the third inclusion in Theorem 1 is strict (Example 2).Example 1 Consider the simple MDP in Figure 3 . First, consider the reward functionThis produces a "dummy" MDP where all policies have the same reward sum. Then, ∀T, π, V ⟨π,T ⟩ = V π , and therefore we haveIn the example DOMDP (assuming the initial state is drawn uniformly from X 0 = {x 1 , x 2 }) one can show that at any time in the trajectory, there is a stationary probability Pr{x t = x 1 } = 1 2 . Let us abuse notation and write π(For the given reward structure we have R(x 2 ) = ( 0 0 ) ⊤ , and therefore:Since the transitions of the MDP are independent of the actions, following the same principle as in equation 7: J⟨π, T ⟩ = 1 2 ⟨R(x 1 ), ⟨•, T ⟩(π)(x 1 )⟩ γ 1-γ . For any noise map ⟨•, T ⟩ ̸ = Id, for the two-state policy it holds that π / ∈ Π T =⇒ ⟨π, T ⟩ ̸ = π. Therefore ⟨π, T ⟩(x 1 ) ̸ = π(x 1 ) and:Example 2 Consider the same MDP in Figure 3 with reward function, and a reward of zero for all other transitions. Take a policy π(x 1 ) = (1 0), π(x 2 ) = (0 1). The policy yields a reward of 10 in state x 1 and a reward of 0 in state x 2 . Again we assume the initial state is drawn uniformly from X 0 = {x 1 , x 2 }. Then, observe:2 ). Observe this noise map yields a policy with non-zero disadvantage, D π (x 1 , T ) = 5γ 1-γ -5γ 1-γ -2.5 = 2.5 and similarly D π (x 2 , T ) = -2.5, therefore π / ∈ Π D . However, the policy is maximally robust:(8) Therefore, π ∈ Π 0 .

