OBSERVATIONAL ROBUSTNESS AND INVARIANCES IN REINFORCEMENT LEARNING VIA LEXICOGRAPHIC OBJECTIVES

Abstract

Policy robustness in Reinforcement Learning (RL) may not be desirable at any price; the alterations caused by robustness requirements from otherwise optimal policies should be explainable and quantifiable. Policy gradient algorithms that have strong convergence guarantees are usually modified to obtain robust policies in ways that do not preserve algorithm guarantees, which defeats the purpose of formal robustness requirements. In this work we study a notion of robustness in partially observable MDPs where state observations are perturbed by a noiseinduced stochastic kernel. We characterise the set of policies that are maximally robust by analysing how the policies are altered by this kernel. We then establish a connection between such robust policies and certain properties of the noise kernel, as well as with structural properties of the underlying MDPs, constructing sufficient conditions for policy robustness. We use these notions to propose a robustness-inducing scheme, applicable to any policy gradient algorithm, to formally trade off the reward achieved by a policy with its robustness level through lexicographic optimisation, which preserves convergence properties of the original algorithm. We test the the proposed approach on safety-critical RL environments, and show how the proposed method helps achieve high robustness in observational noise problems.

1. INTRODUCTION

Robustness in Reinforcement Learning (RL) (Morimoto & Doya, 2005) can be looked at from different perspectives: (1) distributional shifts in the training data with respect to the deployment stage Satia & Lave Jr (1973); Heger (1994); Nilim & El Ghaoui (2005) ; Xu & Mannor (2006) ; (2) uncertainty in the model or observations (Pinto et al., 2017; Everett et al., 2021) ; (3) adversarial attacks against actions (Pattanaik et al., 2017; Fischer et al., 2019) ; and (4) sensitivity of neural networks (used as policy or value function approximators) towards input disturbances (Kos & Song, 2017; Huang et al., 2017) . Robustness does not naturally emerge in most RL settings, since agents are typically only trained in a single, unchanging environment: There is a trade-off between how robust a policy is and how close it is to the set of optimal policies in its training environment, and in safety-critical applications we may need to provide formal guarantees for this trade-off. Motivation Consider a complex dynamical system where we need to synthesise a controller (policy) through a model-free approach. When using a simulator for training we expect the deployment of the controller in the real system to be affected by different sources of noise, possibly not predictable or modelled (e.g. for networked components we may have sensor faults, communication delays, etc). In safety-critical systems, robustness (in terms of successfully controlling the system under disturbances) should preserve formal guarantees, and plenty of effort has been put on developing formal convergence guarantees on policy gradient algorithms (Agarwal et al., 2021; Bhandari & Russo, 2019 ) which vanish when "robustifying" policies through regularisation or adversarial approaches. Therefore, for such applications one would need a scheme to regulate the robustnessutility trade-off in RL policies that preserves the formal guarantees of original algorithms and retains sub-optimality conditions for the original problem.

