OBSERVATIONAL ROBUSTNESS AND INVARIANCES IN REINFORCEMENT LEARNING VIA LEXICOGRAPHIC OBJECTIVES

Abstract

Policy robustness in Reinforcement Learning (RL) may not be desirable at any price; the alterations caused by robustness requirements from otherwise optimal policies should be explainable and quantifiable. Policy gradient algorithms that have strong convergence guarantees are usually modified to obtain robust policies in ways that do not preserve algorithm guarantees, which defeats the purpose of formal robustness requirements. In this work we study a notion of robustness in partially observable MDPs where state observations are perturbed by a noiseinduced stochastic kernel. We characterise the set of policies that are maximally robust by analysing how the policies are altered by this kernel. We then establish a connection between such robust policies and certain properties of the noise kernel, as well as with structural properties of the underlying MDPs, constructing sufficient conditions for policy robustness. We use these notions to propose a robustness-inducing scheme, applicable to any policy gradient algorithm, to formally trade off the reward achieved by a policy with its robustness level through lexicographic optimisation, which preserves convergence properties of the original algorithm. We test the the proposed approach on safety-critical RL environments, and show how the proposed method helps achieve high robustness in observational noise problems.

1. INTRODUCTION

Robustness in Reinforcement Learning (RL) (Morimoto & Doya, 2005) can be looked at from different perspectives: (1) distributional shifts in the training data with respect to the deployment stage Satia & Lave Jr (1973) ; Heger (1994) ; Nilim & El Ghaoui (2005) ; Xu & Mannor (2006) ; (2) uncertainty in the model or observations (Pinto et al., 2017; Everett et al., 2021) ; (3) adversarial attacks against actions (Pattanaik et al., 2017; Fischer et al., 2019) ; and (4) sensitivity of neural networks (used as policy or value function approximators) towards input disturbances (Kos & Song, 2017; Huang et al., 2017) . Robustness does not naturally emerge in most RL settings, since agents are typically only trained in a single, unchanging environment: There is a trade-off between how robust a policy is and how close it is to the set of optimal policies in its training environment, and in safety-critical applications we may need to provide formal guarantees for this trade-off. Motivation Consider a complex dynamical system where we need to synthesise a controller (policy) through a model-free approach. When using a simulator for training we expect the deployment of the controller in the real system to be affected by different sources of noise, possibly not predictable or modelled (e.g. for networked components we may have sensor faults, communication delays, etc). In safety-critical systems, robustness (in terms of successfully controlling the system under disturbances) should preserve formal guarantees, and plenty of effort has been put on developing formal convergence guarantees on policy gradient algorithms (Agarwal et al., 2021; Bhandari & Russo, 2019 ) which vanish when "robustifying" policies through regularisation or adversarial approaches. Therefore, for such applications one would need a scheme to regulate the robustnessutility trade-off in RL policies that preserves the formal guarantees of original algorithms and retains sub-optimality conditions for the original problem. Lexicographic Reinforcement Learning Recently, lexicographic optimisation (Isermann, 1982; Rentmeesters et al., 1996) has been applied to the multi-objective RL setting (Skalse et al., 2022b) . In an LRL setting with different reward-maximising objective functions {K i } 1≤i≤n , some objectives may be more important than others, and so we may want to obtain policies that solve the multi-objective problem in a lexicographically prioritised way, i.e., "find the policies that optimise objective i (reasonably well), and from those the ones that optimise objective i + 1 (reasonably well), and so on". There exist both value-and policy-based algorithms for LRL, and the approach is broadly applicable to (most) state of the art RL algorithms (Skalse et al., 2022b) .

Previous Work

In robustness against model uncertainty, the MDP may have noisy or uncertain reward signals or transition probabilities, as well as possible resulting distributional shifts in the training data (Heger, 1994; Xu & Mannor, 2006; Fu et al., 2018; Pattanaik et al., 2018; Pirotta et al., 2013; Abdullah et al., 2019) , which connects to ideas on distributionally robust optimisation (Wiesemann et al., 2014; Van Parys et al., 2015) . One of the first examples is Heger (1994) , where the author proposes using minimax approaches to learn Q functions that minimise the worst case total discounted cost in a general MDP setting. Derman et al. (2020) propose a Bayesian approach to deal with uncertainty in the transitions. Another robustness sub-problem is studied in the form of adversarial attacks or disturbances by considering adversarial attacks on policies or action selection in RL agents (Gleave et al., 2020; Lin et al., 2017; Tessler et al., 2019; Pan et al., 2019; Tan et al., 2020; Klima et al., 2019) . Recently, Gleave et al. (2020) propose the idea that instead of modifying observations, one could attack RL agents by swapping the policy for an adversarial one at given times. For a detailed review on Robust RL see Moos et al. (2022) . Our work focuses in the study of robustness versus observational disturbances, where agents observe a disturbed state measurement and use it as input for the policy (Kos & Song, 2017; Huang et al., 2017; Behzadan & Munir, 2017; Mandlekar et al., 2017; Zhang et al., 2020; 2021) . In particular Mandlekar et al. ( 2017) consider both random and adversarial state perturbations, and introduce physically plausible generation of disturbances in the training of RL agents that make the resulting policy robust towards realistic disturbances. Zhang et al. (2020) propose a state-adversarial MDP framework, and utilise adversarial regularising terms that can be added to different deep RL algorithms to make the resulting policies more robust to observational disturbances, minimising the distance bound between disturbed and undisturbed policies through convex relaxations of neural networks to obtain robustness guarantees. Zhang et al. (2021) study how LSTM increases robustness with optimal state-perturbing adversaries.

1.1. MAIN CONTRIBUTIONS

Most existing work on RL with observational disturbances proposes modifying RL algorithms (learning to deal with perturbations through linear combinations of regularising loss terms or adversarial terms) that come at the cost of explainability (in terms of sub-optimality bounds) and verifiability, since the induced changes in the new policies result in a loss of convergence guarantees. Our main contributions are summarised in the following points. Structure of Robust Policy Setsfoot_0 We consider general unknown stochastic disturbances and formulate a quantitative definition of observational robustness that allows us to characterise the sets of robust policies for any MDP in the form of operator-invariant sets. We analyse how the structure of these sets depends on the MDP and noise kernel, and obtain an inclusion relation (i.e. Inclusion Theorem, Section 3) providing intuition into how we can search for robust policies more effectively. Verifiable Robustness through LRL While LRL is developed for reward maximising objectives, through the proposed observational robustness definition we can cast robustness as a lexicographic objective, allowing us to retain policy optimality up to a specified tolerance while maximising robustness and yielding a mechanism to formally control the performance-robustness trade-off. This preserves convergence guarantees of the original algorithm and yields formal bounds on policy suboptimality. We provide numerical examples on how this logic is applied to existing policy gradient



We claim novelty on the application of such concepts to the understanding and improvement of robustness in disturbed observation RL. Although we have not found our results in previous work, there are strong connections between Sections 2-3 in this paper and the literature on planning for POMDPs(Spaan & Vlassis, 2004; Spaan, 2012) and MDP invariances(Ng et al., 1999; van der Pol et al., 2020; Skalse et al., 2022a).

