ROBUST REINFORCEMENT LEARNING WITH DISTRIBU-TIONAL RISK-AVERSE FORMULATION

Abstract

The purpose of robust reinforcement learning is to make predictions more robust to changes in the dynamics or rewards of the system. This problem is particularly important when dynamics and rewards of the environment are estimated from the data. However, without constraints, this problem is intractable. In this paper, we approximate the Robust Reinforcement Learning constrained with a f -divergence using an approximate Risk-Averse formulation. We show that the classical Reinforcement Learning formulation can be robustified using a standard deviation penalization of the objective. Two algorithms based on Distributional Reinforcement Learning, one for discrete and one for continuous action spaces, are proposed and tested in a classical Gym environment to demonstrate the robustness of the algorithms.

1. INTRODUCTION

The classical Reinforcement Learning (RL)Sutton & Barto (2018) problem using Markov Decision Processes (MDPs) modelization gives a practical framework to solve sequential decision problems under uncertainty of the environment. However, for real-world applications, the final chosen policy can sometimes be very sensitive to sampling errors, inaccuracy of the model parameters, and definition of the reward. This problem motivates robust Reinforcement Learning, aiming to reduce such sensitivity by taking to account that the transition and/or reward function (P, r) may vary arbitrarily inside a given uncertainty set. The optimal solution can be seen as the solution that maximizes a worst-case problem in this uncertainty set or the result of a dynamic zero-sum game where the agent tries to find the best policy under the most adversarial environment (Abdullah et al., 2019) . In general, this problem is NP-hard (Wiesemann et al., 2013) due to the complex max-min problem, making it challenging to solve in a discrete state action space and to scale to a continuous state action space. Many algorithms exist for the tabular case for Robust MDPs with Wasserstein constraints over dynamics and reward such as Yang (2017); Petrik & Russel (2019); Grand-Clément & Kroer (2020a;b) or for L ∞ constrained S-rectangular Robust MDPs (Behzadian et al., 2021) . Here we focus on a more general continuous state space S with a discrete or continuous action space A and with constraints defined using f -divergence. Robust RL (Morimoto & Doya, 2005) with continuous action space focuses on robustness in the dynamics of the system (changes of P ) and has been studied in Abdullah et al. 2019). Here, we tackle the problem of robustness through dynamics of the system.. Recently, the issue of the Robust Q-Learning has also been addressed in Ertefaie et al. (2021) . In this paper, we show that it is possible to tackle a Robust Distributional Reinforcement Learning problem with f -divergence constraints by solving a risk-averse RL problem, using a formulation based on mean standard deviation optimization. The idea beyond that relies on the argument from Robust Learning theory, stating that Robust Learning under an uncertainty set defined with f -divergence is asymptotically close to Mean-Variance (Gotoh et al., 2018) or Mean-Standard deviation optimization (Duchi et al., 2016; Duchi & Namkoong, 2018) . In this work, we focus on the idea that generalization, regularization, and robustness are strongly linked in RL or MDPs as shown in Husain et al. ( 2021 2022). We show that is it possible to improve the Robustness of RL algorithms with variance/standard deviation regularisation. Moreover, the problem of uncertainty under the distribution of the environment is transformed into a problem with uncertainty over the distribution of the rewards, which makes it tractable. Note that our work is related to Smirnova et al. (2019) as they penalise the expectation by the variance of returns. However, their approach differs from ours since they use the variance estimate under a Gaussian assumption of distributions while we use a standard deviation penalization without any distribution assumptions. Moreover, the idea of robustness in the change of dynamics is not demonstrated numerically, and the problem tackled is different since they consider close policy distributions, while we consider dynamic distributions. The contribution of the work is the following: we motivate the use of standard deviation penalization and derive two algorithms for discrete and continuous action space that are robust to changes in dynamics. These algorithms only require one additional parameter tuning, which is the Mean-Standard Deviation trade-off. Moreover, we show that our formulation using Distributional Reinforcement Learning is robust to changing transition dynamics in environments with both discrete and continuous action spaces both in the Mujoco suite and in stochastic environments derived from Mujoco. Related topics : Regularised MDPs : Policy Regularisation in RL Geist et al. ( 2019) has been studied and led to state-of-the-art algorithms such as PPO and SAC (Schulman et al., 2017b; Haarnoja et al., 2018; Vieillard et al., 2020) . In these algorithms, an additional penalisation based on the current policy is added to the classical objective function. The idea is different, as we penalize our mean objective function using the standard deviation of the return distribution. Being pessimistic about the distributional state-value function leads to more stable learning, reduces the variance, and, tends to improve the robustness of systems as demonstrate (Brekelmans et al., 2022) Distributional RL : Second-order estimation is done using Distributional Reinforcement Learning (Bellemare et al., 2017; Zhang & Weng) using a quantile estimate of our distribution to approximate our action value function (Dabney et al., 2017; 2018) with the QRDQN and IQN algorithms. Distributional state-action function representation is also used to learn an accurate critic for a policy-based algorithm, such as in Kuznetsov et al. (2020); Ma et al. (2021); Nam et al. (2021) . Risk-Averse RL : Risk-averse RL aims at minimizing different objectives than the classical mean optimization e.g. CVaR or other risk measures. For example, Dabney et al. (2018); Ma et al. (2021) use distributional RL for optimizing different risk measures. Our goal is to show the robustness of using risk-averse solutions to our initial problem. Our formulation is close to mean-variance formulation (Jain et al., 2021b; Wang & Zhou, 2020) that already exists in risk-averse RL, although not using a distributional framework that shows highly competitive performance in a controlled setting. Pessimism and Optimism in Distributional RL Moskovitz et al. ( 2021) describes a way of performing Optimistic / Pessimistic Deep RL using a constructed confidence interval with the variance of rewards. Their work is close to ours in the pessimistic case but the confidence interval is expressed in terms of variance of expectation estimate and not using the variance of the distribution itself. Moreover, they use an adaptative regularizer where we look at the interest of using a fixed parameter. Preliminaries: Taking into account a Markov Decision Process (MDP) (S, A, P, γ), where S is the state space, A is the action space, P (r, s ′ | s, a) is the reward and transition distribution from state s to s ′ taking action a and γ ∈ (0, 1) is the discount factor. Stochastic policies are denoted π(a | s) : S → ∆(A) and we consider the cases of action space either discrete our continuous. A rollout or trajectory using π from state s using initial action a is defined as the random sequence τ P,π (s, a) = ((s 0 , a 0 , r 0 ) , (s 1 , a 1 , r 1 ) , . . .) with s 0 = s, a 0 = a, a t ∼ π (• | s t ) and (r t , s t+1 ) ∼ P (•, • | s t , a t ) ; we denote the distribution on rollouts by P(τ ) with P(τ ) =



(2019); Singh et al. (2020); Urpí et al. (2021); Eysenbach & Levine (2021) among others. Eysenbach & Levine (2021) tackles the problem of both reward and transition using Max Entropy RL, whereas the problem of robustness in action noise perturbation is presented in Tessler et al. (

); Derman & Mannor (2020); Derman et al. (2021); Ying et al. (2021); Brekelmans et al. (

. Recent advances in Robust MDPs have shown a link between this field and Regularised MDPs as in Derman et al. (2021); Kumar et al. (2022).

