BALANCING CONSTRAINTS AND REWARDS WITH META-GRADIENT D4PG

Abstract

Deploying Reinforcement Learning (RL) agents to solve real-world applications often requires satisfying complex system constraints. Often the constraint thresholds are incorrectly set due to the complex nature of a system or the inability to verify the thresholds offline (e.g, no simulator or reasonable offline evaluation procedure exists). This results in solutions where a task cannot be solved without violating the constraints. However, in many real-world cases, constraint violations are undesirable yet they are not catastrophic, motivating the need for soft-constrained RL approaches. We present a soft-constrained RL approach that utilizes meta-gradients to find a good trade-off between expected return and minimizing constraint violations. We demonstrate the effectiveness of this approach by showing that it consistently outperforms the baselines across four different Mujoco domains.

1. INTRODUCTION

Reinforcement Learning (RL) algorithms typically try to maximize an expected return objective (Sutton & Barto, 2018) . This approach has led to numerous successes in a variety of domains which include board-games (Silver et al., 2017) , computer games (Mnih et al., 2015; Tessler et al., 2017) and robotics (Abdolmaleki et al., 2018) . However, formulating real-world problems with only an expected return objective is often sub-optimal when tackling many applied problems ranging from recommendation systems to physical control systems which may include robots, self-driving cars and even aerospace technologies. In many of these domains there are a variety of challenges preventing RL from being utilized as the algorithmic solution framework. Recently, Dulac-Arnold et al. (2019) presented nine challenges that need to be solved to enable RL algorithms to be utilized in real-world products and systems. One of those challenges is handling constraints. All of the above domains may include one or more constraints related to cost, wear-and-tear, or safety, to name a few.

Hard and Soft Constraints:

There are two types of constraints that are encountered in constrained optimization problems; namely hard-constraints and soft-constraints (Boyd & Vandenberghe, 2004) . Hard constraints are pairs of pre-specified functions and thresholds that require the functions, when evaluated on the solution, to respect the thresholds. As such, these constraints may limit the feasible solution set. Soft constraints are similar to hard constraints in the sense that they are defined by pairs of pre-specified functions and thresholds, however, a soft constraint does not require the solution to hold the constraint; instead, it penalizes the objective function (according to a specified rule) if the solution violates the constraint (Boyd & Vandenberghe, 2004; Thomas et al., 2017) . Motivating Soft-Constraints: In real-world products and systems, there are many examples of soft-constraints; that is, constraints that can be violated, where the violated behaviour is undesirable but not catastrophic (Thomas et al., 2017; Dulac-Arnold et al., 2020b) . One concrete example is that of energy minimization in physical control systems. Here, the system may wish to reduce the amount of energy used by setting a soft-constraint. Violating the constraint is inefficient, but not catastrophic to the system completing the task. In fact, there may be desirable characteristics that can only be attained if there are some constraint violations (e.g., a smoother/faster control policy). Another common setting is where it is unclear how to set a threshold. In many instances, a product manager may desire to increase the level of performance on a particular product metric A, while ensuring that another metric B on the same product does not drop by 'approximately X%'. The value 'X' is often inaccurate and may not be feasible in many cases. In both of these settings, violating the threshold is undesirable, yet does not have catastrophic consequences. Lagrange Optimization: In the RL paradigm, a number of approaches have been developed to incorporate hard constraints into the overall problem formulation (Altman, 1999; Tessler et al., 2018; Efroni et al., 2020; Achiam et al., 2017; Bohez et al., 2019; Chow et al., 2018; Paternain et al., 2019; Zhang et al., 2020; Efroni et al., 2020) . One popular approach is to model the problem as a Constrained Markov Decision Process (CMDP) (Altman, 1999) . In this case, one method is to solve the following problem formulation: max ⇡ J ⇡ R s.t. J ⇡ C  , where ⇡ is a policy, J ⇡ R is the expected return, J ⇡ C is the expected cost and is a constraint violation threshold. This is often solved by performing alternating optimization on the unconstrained Lagrangian relaxation of the original problem (e.g. Tessler et al. ( 2018)), defined as: min 0 max ⇡ J ⇡ R + ( J ⇡ C ). The updates alternate between learning the policy and the Lagrange multiplier . In many previous constrained RL works (Achiam et al., 2017; Tessler et al., 2018; Ray et al., 2019; Satija et al., 2020) , because the problem is formulated with hard constraints, there are some domains in each case where a feasible solution is not found. This could be due to approximation errors, noise, or the constraints themselves being infeasible. The real-world applications, along with empirical constrained RL research results, further motivates the need to develop a soft-constrained RL optimization approach. Ideally, in this setup, we would like an algorithm that satisfies the constraints while solving the task by maximizing the objective. If the constraints cannot be satisfied, then this algorithm finds a good trade-off (that is, minimizing constraint violations while solving the task by maximizing the objective). In this paper, we extend the constrained RL Lagrange formulation to perform soft-constrained optimization by formulating the constrained RL objective as a nested optimization problem (Sinha et al., 2017) using meta-gradients. We propose MetaL that utilizes meta-gradients (Xu et al., 2018; Zahavy et al., 2020) to improve upon the trade-off between reducing constraint violations and improving expected return. We focus on Distributed Distributional Deterministic Policy Gradients (D4PG) (Barth-Maron et al., 2018) as the underlying algorithmic framework, a state-of-the-art continuous control RL algorithm. We show that MetaL can capture an improved trade-off between expected return and constraint violations compared to the baseline approaches. We also introduce a second approach called MeSh that utilizes meta-gradients by adding additional representation power to the reward shaping function. Our main contributions are as follows: (1) We extend D4PG to handle constraints by adapting it to Reward Constrained Policy Optimization (RCPO) (Tessler et al., 2018) yielding Reward Constrained D4PG (RC-D4PG); (2) We present a soft constrained meta-gradient technique: Meta-Gradients for the Lagrange multiplier learning rate (MetaL)foot_0 ; (3) We derive the meta-gradient update for MetaL (Theorem 1); (4) We perform extensive experiments and investigative studies to showcase the properties of this algorithm. MetaL outperforms the baseline algorithms across domains, safety coefficients and thresholds from the Real World RL suite (Dulac-Arnold et al., 2020b) .

2. BACKGROUND

A Constrained Markov Decision Process (CMDP) is an extension to an MDP (Sutton & Barto, 2018) and consists of the tuple hS, A, P, R, C, i where S is the state space; A is the action space; P : S ⇥ A ! S is a function mapping states and actions to a distribution over next states; R : S ⇥ A ! R is a bounded reward function and C : S ⇥ A ! R K is a K dimensional function representing immediate penalties (or costs) relating to K constraints. The solution to a CMDP is a policy ⇡ : S ! A which is a mapping from states to a probability distribution over actions. This policy aims to maximize the expected return J ⇡ R = E[ P 1 t=0 t r t ] and satisfy the constraints J ⇡ Ci = E[ P 1 t=0 t c i,t ]  i , i = 1 . . . K. For the purpose of the paper, we consider a single constraint; that is, K = 1, but this can easily be extended to multiple constraints. Meta-Gradients is an approach to optimizing hyperparameters such as the discount factor, learning rates, etc. by performing online cross validation while simultaneously optimizing for the overall RL optimization objective such as the expected return (Xu et al., 2018; Zahavy et al., 2020) . The goal is to optimize both an inner loss and an outer loss. The update of the ✓ parameters on the inner



This is also the first time meta-gradients have been applied to an algorithm with an experience replay.

