ROBUST CONSTRAINED REINFORCEMENT LEARNING FOR CONTINUOUS CONTROL WITH MODEL MISSPECI-FICATION

Abstract

Many real-world physical control systems are required to satisfy constraints upon deployment. Furthermore, real-world systems are often subject to effects such as non-stationarity, wear-and-tear, uncalibrated sensors and so on. Such effects effectively perturb the system dynamics and can cause a policy trained successfully in one domain to perform poorly when deployed to a perturbed version of the same domain. This can affect a policy's ability to maximize future rewards as well as the extent to which it satisfies constraints. We refer to this as constrained model misspecification. We present an algorithm that mitigates this form of misspecification, and showcase its performance in multiple simulated Mujoco tasks from the Real World Reinforcement Learning (RWRL) suite.

1. INTRODUCTION

Reinforcement Learning (RL) has had a number of recent successes in various application domains which include computer games (Silver et al., 2017; Mnih et al., 2015; Tessler et al., 2017) and robotics (Abdolmaleki et al., 2018a) . As RL and deep learning continue to scale, an increasing number of real-world applications may become viable candidates to take advantage of this technology. However, the application of RL to real-world systems is often associated with a number of challenges (Dulac-Arnold et al., 2019; Dulac-Arnold et al., 2020) . We will focus on the following two: Challenge 1 -Constraint satisfaction: One such challenge is that many real-world systems have constraints that need to be satisfied upon deployment (i.e., hard constraints); or at least the number of constraint violations as defined by the system need to be reduced as much as possible (i.e., soft-constraints). This is prevalent in applications ranging from physical control systems such as autonomous driving and robotics to user facing applications such as recommender systems. Challenge 2 -Model Misspecification (MM): Many of these systems suffer from another challenge: model misspecification. We refer to the situation in which an agent is trained in one environment but deployed in a different, perturbed version of the environment as an instance of model misspecification. This may occur in many different applications and is well-motivated in the literature (Mankowitz et al., 2018; 2019; Derman et al., 2018; 2019; Iyengar, 2005; Tamar et al., 2014) . There has been much work on constrained optimization in the literature (Altman, 1999; Tessler et al., 2018; Efroni et al., 2020; Achiam et al., 2017; Bohez et al., 2019) . However, to our knowledge, the effect of model misspecification on an agent's ability to satisfy constraints at test time has not yet been investigated.

Constrained Model Misspecification (CMM):

We consider the scenario in which an agent is required to satisfy constraints at test time but is deployed in an environment that is different from its training environment (i.e., a perturbed version of the training environment). Deployment in a perturbed version of the environment may affect the return achieved by the agent as well as its ability to satisfy the constraints. We refer to this scenario as constrained model misspecification. This problem is prevalent in many real-world applications where constraints need to be satisfied but the environment is subject to state perturbations effects such as wear-and-tear, partial observability etc., the exact nature of which may be unknown at training time. Since such perturbations can significantly impact the agent's ability to satisfy the required constraints it is insufficient to simply ensure that constraints are satisfied in the unperturbed version of the environment. Instead, the presence of unknown environment variations needs to be factored into the training process. One area where such considerations are of particular practical relevance is sim2real transfer where the unknown sim2real gap can make it hard to ensure that constraints will be satisfied on the real system (Andrychowicz et al., 2018; Peng et al., 2018; Wulfmeier et al., 2017; Rastogi et al., 2018; Christiano et al., 2016) . Of course, one could address this issue by limiting the capabilities of the system being controlled in order to ensure that constraints are never violated, for instance by limiting the amount of current in an electric motor. Our hope is that our methods can outperform these more blunt techniques, while still ensuring constraint satisfaction in the deployment domain. Main Contributions: In this paper, we aim to bridge the two worlds of model misspecification and constraint satisfaction. We present an RL objective that enables us to optimize a policy that aims to be robust to CMM. Our contributions are as follows: (1) Introducing the Robust Return Robust Constraint (R3C) and Robust Constraint (RC) RL objectives that aim to mitigate CMM as defined above. This includes the definition of a Robust Constrained Markov Decision Process (RC-MDP). (2) Derive corresponding R3C and RC value functions and Bellman operators. Provide an argument showing that these Bellman operators converge to fixed points. These are implemented in the policy evaluation step of actor-critic R3C algorithms. (3) Implement five different R3C and RC algorithmic variants on top of D4PG and DMPO, (two state-of-the-art continuous control RL algorithms). (4) Empirically demonstrate the superior performance of our algorithms, compared to various baselines, with respect to mitigating CMM. This is shown consistently across 6 different Mujoco tasks from the Real-World RL (RWRL) suitefoot_0 .

2.1. MARKOV DECISION PROCESSES

A Robust Markov Decision Process (R-MDP) is defined as a tuple hS, A, R, , Pi where S is a finite set of states, A is a finite set of actions, R : S ⇥ A ! R is a bounded reward function and 2 [0, 1) is the discount factor; P(s, a) ✓ M(S) is an uncertainty set where M(S) is the set of probability measures over next states s 0 2 S. This is interpreted as an agent selecting a state and action pair, and the next state s 0 is determined by a conditional measure p(s 0 |s, a) 2 P(s, a) (Iyengar, 2005) . We want the agent to learn a policy ⇡ : S ! A, which is a mapping from states to actions that is robust with respect to this uncertainty set. For the purpose of this paper, we consider deterministic policies, but this can easily be extended to stochastic policies too. The robust value function V ⇡ : S ! R for a policy ⇡ is defined as V ⇡ (s) = inf p2P(s,⇡(s)) V ⇡,p (s) where V ⇡,p (s) = r(s, ⇡(s)) + p(s 0 |s, ⇡(s))V ⇡,p (s 0 ). A rectangularity assumption on the uncertainty set (Iyengar, 2005) ensures that "nature" can choose a worst-case transition function independently for every state s and action a. This means that during a trajectory, at each timestep, nature can choose any transition model from the uncertainty set to reduce the performance of the agent. A robust policy optimizes for the robust (worst-case) expected return objective: J R (⇡) = inf p2P E p,⇡ [ P 1 t=0 t r t ]. The robust value function can be expanded as V ⇡ (s) = r(s, ⇡(s)) + inf p2P (s,⇡(s)) E p [V ⇡ (s 0 )|s, ⇡(s)]. As in (Tamar et al., 2014) , we can define an operator inf P(s,a) v : R |S| ! R as inf P(s,a) v = inf{p > v|p 2 P(s, a)}. We can also define an operator for some policy ⇡ as 



https://github.com/google-research/realworldrl_suite



R |S| ! R |S| where { inf ⇡ v}(s) = inf P(s,⇡(s)) v.Then, we have defined the Robust Bellman

