ROBUST REINFORCEMENT LEARNING WITH DISTRIBU-TIONAL RISK-AVERSE FORMULATION

Abstract

The purpose of robust reinforcement learning is to make predictions more robust to changes in the dynamics or rewards of the system. This problem is particularly important when dynamics and rewards of the environment are estimated from the data. However, without constraints, this problem is intractable. In this paper, we approximate the Robust Reinforcement Learning constrained with a f -divergence using an approximate Risk-Averse formulation. We show that the classical Reinforcement Learning formulation can be robustified using a standard deviation penalization of the objective. Two algorithms based on Distributional Reinforcement Learning, one for discrete and one for continuous action spaces, are proposed and tested in a classical Gym environment to demonstrate the robustness of the algorithms.

1. INTRODUCTION

The classical Reinforcement Learning (RL)Sutton & Barto (2018) problem using Markov Decision Processes (MDPs) modelization gives a practical framework to solve sequential decision problems under uncertainty of the environment. However, for real-world applications, the final chosen policy can sometimes be very sensitive to sampling errors, inaccuracy of the model parameters, and definition of the reward. This problem motivates robust Reinforcement Learning, aiming to reduce such sensitivity by taking to account that the transition and/or reward function (P, r) may vary arbitrarily inside a given uncertainty set. The optimal solution can be seen as the solution that maximizes a worst-case problem in this uncertainty set or the result of a dynamic zero-sum game where the agent tries to find the best policy under the most adversarial environment (Abdullah et al., 2019) . In general, this problem is NP-hard (Wiesemann et al., 2013) due to the complex max-min problem, making it challenging to solve in a discrete state action space and to scale to a continuous state action space. Many algorithms exist for the tabular case for Robust MDPs with Wasserstein constraints over dynamics and reward such as Yang ( 2017 2021) tackles the problem of both reward and transition using Max Entropy RL, whereas the problem of robustness in action noise perturbation is presented in Tessler et al. (2019) . Here, we tackle the problem of robustness through dynamics of the system.. Recently, the issue of the Robust Q-Learning has also been addressed in Ertefaie et al. (2021) . In this paper, we show that it is possible to tackle a Robust Distributional Reinforcement Learning problem with f -divergence constraints by solving a risk-averse RL problem, using a formulation based on mean standard deviation optimization. The idea beyond that relies on the argument from Robust Learning theory, stating that Robust Learning under an uncertainty set defined with f -divergence is asymptotically close to Mean-Variance (Gotoh



); Petrik & Russel (2019); Grand-Clément & Kroer (2020a;b) or for L ∞ constrained S-rectangular Robust MDPs (Behzadian et al., 2021). Here we focus on a more general continuous state space S with a discrete or continuous action space A and with constraints defined using f -divergence. Robust RL (Morimoto & Doya, 2005) with continuous action space focuses on robustness in the dynamics of the system (changes of P ) and has been studied in Abdullah et al. (2019); Singh et al. (2020); Urpí et al. (2021); Eysenbach & Levine (2021) among others. Eysenbach & Levine (

