SDAC: EFFICIENT SAFE REINFORCEMENT LEARNING WITH LOW-BIAS DISTRIBUTIONAL ACTOR-CRITIC

Abstract

To apply reinforcement learning (RL) to real-world applications, agents are required to adhere to the safety guidelines of their respective domains. Safe RL can effectively handle the guidelines by converting them into constraints of the RL problem. In this paper, we develop a safe distributional RL method based on the trust region method, which can satisfy constraints consistently. However, policies may not meet the safety guidelines due to the estimation bias of distributional critics, and importance sampling required for the trust region method can hinder performance due to its significant variance. Hence, we enhance safety performance through the following approaches. First, we train distributional critics to have low estimation biases using proposed target distributions where bias-variance can be traded off. Second, we propose novel surrogates for the trust region method expressed with Q-functions using the reparameterization trick. Additionally, depending on initial policy settings, there can be no policy satisfying constraints within a trust region. To handle this infeasible issue, we propose a gradient integration method which guarantees to find a policy satisfying all constraints from an unsafe initial policy. From extensive experiments, the proposed method with risk-averse constraints shows minimal constraint violations while achieving high returns compared to existing safe RL methods. Furthermore, we demonstrate the benefit of safe RL for problems in which the reward cannot be easily specified.

1. INTRODUCTION

Deep reinforcement learning (RL) enables reliable control of complex robots (Merel et al., 2020; Peng et al., 2021; Rudin et al., 2022) . Miki et al. (2022) have shown that RL can control quadrupedal robots more robustly than existing model-based optimal control methods, and Peng et al. ( 2022) have performed complex natural motion tasks using physically simulated characters. In order to successfully apply RL to real-world systems, it is essential to design a proper reward function which reflects safety guidelines, such as collision avoidance and limited energy consumption, as well as the goal of the given task. However, finding the reward function that considers all of such factors involves a cumbersome and time-consuming task since RL algorithms must be repeatedly performed to verify the results of the designed reward function. Instead, safe RL, which handles safety guidelines as constraints, can be an appropriate solution. A safe RL problem can be formulated using a constrained Markov decision process (Altman, 1999) , where not only the reward but also cost functions, which output the safety guideline signals, are defined. By defining constraints using risk measures, such as condtional value at risk (CVaR), of the sum of costs, safe RL aims to maximize returns while satisfying the constraints. Under the safe RL framework, the training process becomes straightforward since there is no need to search for a reward that reflects the safety guidelines. The most crucial part of safe RL is to satisfy the safety constraints, and it requires two conditions. First, constraints should be estimated with low biases. In general RL, the return is estimated using a function estimator called a critic, and, in safe RL, additional critics are used to estimate the constraint values. In our case, constraints are defined using risk measures, so it is essential to use distributional critics (Dabney et al., 2018b) . Then, the critics can be trained using the distributional Bellman update (Bellemare et al., 2017) . However, the Bellman update only considers the one-step temporal difference, which can induce a large bias. The estimation bias makes it difficult for critics to judge the policy, which can lead to the policy becoming overly conservative or risky, as shown in Section 5.3. Therefore, there is a need for a method that can train distributional critics with low biases. Second, a policy update method considering safety constraints, denoted by a safe policy update rule, is required not only to maximize the reward sum but also to satisfy the constraints after updating the policy. Existing safe policy update rules can be divided into the trust region-based and Lagrangian methods. The trust region-based method calculates the update direction by approximating the safe RL problem within a trust region and updates the policy through a line search (Yang et al., 2020; Kim & Oh, 2022a) . The Lagrangian method converts the safe RL problem into a dual problem and updates the policy and Lagrange multipliers (Yang et al., 2021) . However, the Lagrangian method is difficult to guarantee satisfying constraints during training theoretically, and the training process can be unstable due to the multipliers (Stooke et al., 2020) . In contrast, trust region-based methods can guarantee to improve returns while satisfying constraints under tabular settings (Achiam et al., 2017) . Still, trust region-based methods also have critical issues. There can be an infeasible starting case, meaning that no policy satisfies constraints within the trust region due to initial policy settings. Thus, proper handling of this case is required, but there is a lack of such handling methods when there are multiple constraints. Furthermore, the trust region-based methods are known as not sampleefficient, as observed in several RL benchmarks (Achiam, 2018; Raffin et al., 2021) . In this paper, we propose an efficient trust region-based safe RL algorithm with multiple constraints, called a safe distributional actor-critic (SDAC). First, to train critics to estimate constraints with low biases, we propose a TD(λ) target distribution combining multiple-step distributions, where biasvariance can be traded off by adjusting the trace-decay λ. Then, under off-policy settings, we present a memory-efficient method to approximate the TD(λ) target distribution using quantile distributions (Dabney et al., 2018b) , which parameterize a distribution as a sum of Dirac functions. Second, to handle the infeasible starting case for multiple constraint settings, we propose a gradient integration method, which recovers policies by reflecting all constraints simultaneously. It guarantees to obtain a policy which satisfies the constraints within a finite time under mild technical assumptions. Also, since all constraints are reflected at once, it can restore the policy more stably than existing handling methods Xu et al. (2021) , which consider only one constraint at a time. Finally, to improve the efficiency of the trust region method as much as Soft Actor-Critic (SAC) (Haarnoja et al., 2018) , we propose novel SAC-style surrogates. We show that the surrogates have bounds within a trust region and empirically confirm improved efficiency in Appendix B. In summary, the proposed algorithm trains distributional critics with low biases using the TD(λ) target distributions and updates a policy using safe policy update rules with the SAC-style surrogates. If the policy cannot satisfy constraints within the trust region, the gradient integration method recovers the policy to a feasible policy set. To evaluate the proposed method, we conduct extensive experiments with four tasks in the Safety Gym environment (Ray et al., 2019) and show that the proposed method with risk-averse constraints achieves high returns with minimal constraint violations during training compared to other safe RL baselines. Also, we experiment with locomotion tasks using robots with different dynamic and kinematic models to demonstrate the advantage of safe RL over traditional RL, such as no reward engineering required. The proposed method has successfully trained locomotion policies with the same straightforward reward and constraints for different robots with different configurations.

2. BACKGROUND

Constrained Markov Decision Processes. We formulate the safe RL problem using constrained Markov decision processes (CMDPs) (Altman, 1999) . A CMDP is defined as (S, A, P , R, C 1,..,K , ρ, γ), where S is a state space, A is an action space, P : S × A × S → [0, 1] is a transition model, R : S × A × S → R is a reward function, C k∈{1,...,K} : S × A × S → R ≥0 are cost functions, ρ : S → [0, 1] is an initial state distribution, and γ ∈ (0, 1) is a discount factor. The state action value, state value, and advantage functions are defined as follows: (1) By substituting the costs for the reward, the cost value functions V π C k (s), Q π C k (s, a), A π C k (s, a) are defined. In the remainder of the paper, the cost parts will be omitted since they can be retrieved by replacing the reward with the costs. Given a policy π from a stochastic policy set Π, the discounted



st, at, st+1) s0 = s, a0 = a , V π R (s) := E π,P ∞ t=0 γ t R(st, at, st+1) s0 = s , A π R (s, a) := Q π R (s, a) -V π R (s).

