DENSITY CONSTRAINED REINFORCEMENT LEARNING

Abstract

Constrained reinforcement learning (CRL) plays an important role in solving safety-critical and resource-limited tasks. However, existing methods typically rely on tuning reward or cost parameters to encode the constraints, which can be tedious and tend to not generalize well. Instead of building sophisticated cost functions for constraints, we present a pioneering study of imposing constraints directly on the state density function of the system. Density functions have clear physical meanings and can express a variety of constraints in a straightforward fashion. We prove the duality between the density function and Q function in CRL and use it to develop an effective primal-dual algorithm to solve density constrained reinforcement learning problems. We provide theoretical guarantees of the optimality of our approach and use a comprehensive set of case studies including standard benchmarks to show that our method outperforms other leading CRL methods in terms of achieving higher reward while respecting the constraints.

1. INTRODUCTION

Constrained reinforcement learning (CRL) (Achiam et al., 2017; Altman, 1999; Dalal et al., 2018; Paternain et al., 2019; Tessler et al., 2019) has received increasing interests as a way of addressing the safety challenges in reinforcement learning (RL) . CRL techniques aim to find the optimal policy that maximizes the cumulative reward signal while respecting the specified constraints. Existing CRL approaches typically involve constructing suitable cost functions and value functions to take into account the constraints. Then a crucial step is to choose appropriate parameters such as thresholds for the cost and value functions to encode the constraints. However, one significant gap between the use of such methods and solving practical RL problems is the correct construction of the cost and value functions, which is typically not solved systematically but relies on engineering intuitions (Paternain et al., 2019) . Simple cost functions may not exhibit satisfactory performance, while sophisticated cost functions may not have clear physical meanings. When cost functions lack clear physical interpretations, it is difficult to formally guarantee the satisfaction of the performance specifications, even if the constraints on the cost functions are fulfilled. Moreover, different environments generally need different cost functions, which makes the tedious tuning process extremely time-consuming. In this work, we fill the gap by imposing constraints on the state density functions as an intuitive and systematic way to encode constraints in RL. Density is a measurement of state concentration in the state space, and is directly related to the state distribution. It has been well-studied in physics (Yang, 1991) and control (Brockett, 2012; Chen & Ames, 2019; Rantzer, 2001) . A variety of real-world constraints are naturally expressed as density constraints in the state space. Pure safety constraints can be trivially encoded as the entire density of the states being contained in the safe region. In more general examples, the vehicle densities in certain areas are supposed to be less than the critical density (Gerwinski & Krug, 1999) to avoid congestion. When spraying pesticide using drones, different parts of a farmland requires different levels of pesticide density. Indeed, in the experiments we will see how these problems are solved with guarantees using density constrained RL (DCRL). Our approach is based on the new theoretical results of the duality relationship between the density function and value function in optimal control (Chen & Ames, 2019). One can prove generic duality between density functions and value functions for both continuous dynamics and discrete-state Markov decision processes (MDP), under various setups such as using Bolza form terminal constraints, infinite horizon discounted rewards, or finite horizon cumulative rewards. In Chen & Ames (2019) the duality is proved for value functions in optimal control, assuming that the full dynamics of the world model is known. In this paper, we take a nontrivial step to establish the duality between the density function and the Q function (Theorem 1). We also reveal that under density constraints, the density function and Q function is also dual to each other (Theorem 2), which enables us to enforce constraints on state density functions in CRL. We propose a model-free primal-dual algorithm (Algorithm 1) to solve the DCRL problem, which is applicable in both discrete and continuous state and action spaces, and can be flexibly combined with off-the-shelf RL methods to update the policy. We prove the optimality of the policies returned by our algorithm if it converges (Proposition 1). We also discuss the approaches to computing the key quantities required by Algorithm 1. Our main contributions are: 1) We are the first to introduce the DCRL problem with constraints on state density, which is associated with a clear physical interpretation. 2) We are the first to prove and use the duality between density functions and Q functions over continuous state space to solve DCRL. 3) Our model-free primal-dual algorithm solves DCRL and can guarantee the optimality of the reward and satisfaction of density constraints simultaneously. 4) We use an extensive set of experiments to show the effectiveness and generalization capabilities of our algorithm over leading approaches such as CPO, RCPO, and PCPO, even when dealing with conflicting requirements. Related work. Safe reinforcement learning (Garcıa & Fernández, 2015) primarily focuses on two approaches: modifying the optimality criteria by combining a risk factor (Heger, 1994; Nilim & El Ghaoui, 2005; Howard & Matheson, 1972; Borkar, 2002; Basu et al., 2008; Sato et al., 2001; Dotan Di Castro & Mannor, 2012; Kadota et al., 2006; Lötjens et al., 2019) and incorporating extra knowledge to the exploration process (Moldovan & Abbeel, 2012; Abbeel et al., 2010; Tang et al., 2010; Geramifard et al., 2013; Clouse & Utgoff, 1992; Thomaz et al., 2006; Chow et al., 2018) . Our method falls into the first category by imposing constraints and is closely related to constrained Markov decision processes (Altman, 1999) (CMDPs) and CRL (Achiam et al., 2017; Lillicrap et al., 2016) . CMDPs and CRL have been extensively studied in robotics (Gu et al., 2017; Pham et al., 2018) , game theory (Altman & Shwartz, 2000) , and communication and networks (Hou & Zhao, 2017; Bovopoulos & Lazar, 1992) . Most previous works consider the constraints on value functions, cost functions and reward functions (Altman, 1999; Paternain et al., 2019; Altman & Shwartz, 2000; Dalal et al., 2018; Achiam et al., 2017; Ding et al., 2020) . Instead, we directly impose constraints on the state density function. Our approach builds on Chen et al. ( 2019) and Chen & Ames (2019), which assume known model dynamics. Instead, in this paper we consider the model-free setting and proved the duality of density functions to Q functions. In Geibel & Wysotzki (2005) density was studied as the probability of entering error states and thus has fundamentally different physical interpretations from us. In Dai et al. (2017) the duality was used to boost the actor-critic algorithm. The duality is also used in the policy evaluation community (Nachum et al., 2019; Nachum & Dai, 2020; Tang et al., 2019) . The offline policy evaluation method proposed by Nachum et al. (2019) can also be used to estimate the state density in our paper, but their focus is policy evaluation rather than constrained RL. Therefore, we claim that this paper is the first work to consider density constraints and use the duality property to solve CRL.

2. PRELIMINARIES

Markov Decision Processes (MDP). An MDP M is a tuple S, A, P, R, γ , where (1) S is the (possibly infinite) set of states; (2) A is the (possibly infinite) set of actions; (3) P : S × A × S → [0, 1] is the transition probability with P (s, a, s ) the probability of transitioning from state s to s when action a ∈ A is taken; (4) R : S × A × S → R is the reward associated with the transition P under the action a ∈ A; (5) γ ∈ [0, 1] is a discount factor. A policy π maps states to a probability distribution over actions where π(a|s) denotes the probability of choosing action a at state s. Let a function φ : S → R specifies the initial state distribution. The objective of an MDP optimization is to find the optimal policy that maximizes the overall discounted reward J p = S φ(s)V π (s)ds, where V π (s) is called the value function and satisfies V π (s) = r π (s) + γ A π(a|s) S P (s, a, s )V π (s )ds da, and r π (s) = A π(a|s) S P (s, a, s )R(s, a, s )ds da is the one-step reward from state s following policy π. For every state s with occurring as an initial state with probability φ(s), it incurs a expected cumulative discounted reward of V π (s). Therefore the overall reward is S φ(s)V π (s)ds. Although the equations are written in integral forms corresponding to continuous state-action space, the discrete counterparts can be derived similarly. Two major methods for solving MDPs are value iteration and

