SDAC: EFFICIENT SAFE REINFORCEMENT LEARNING WITH LOW-BIAS DISTRIBUTIONAL ACTOR-CRITIC

Abstract

To apply reinforcement learning (RL) to real-world applications, agents are required to adhere to the safety guidelines of their respective domains. Safe RL can effectively handle the guidelines by converting them into constraints of the RL problem. In this paper, we develop a safe distributional RL method based on the trust region method, which can satisfy constraints consistently. However, policies may not meet the safety guidelines due to the estimation bias of distributional critics, and importance sampling required for the trust region method can hinder performance due to its significant variance. Hence, we enhance safety performance through the following approaches. First, we train distributional critics to have low estimation biases using proposed target distributions where bias-variance can be traded off. Second, we propose novel surrogates for the trust region method expressed with Q-functions using the reparameterization trick. Additionally, depending on initial policy settings, there can be no policy satisfying constraints within a trust region. To handle this infeasible issue, we propose a gradient integration method which guarantees to find a policy satisfying all constraints from an unsafe initial policy. From extensive experiments, the proposed method with risk-averse constraints shows minimal constraint violations while achieving high returns compared to existing safe RL methods. Furthermore, we demonstrate the benefit of safe RL for problems in which the reward cannot be easily specified.

1. INTRODUCTION

Deep reinforcement learning (RL) enables reliable control of complex robots (Merel et al., 2020; Peng et al., 2021; Rudin et al., 2022) . Miki et al. (2022) have shown that RL can control quadrupedal robots more robustly than existing model-based optimal control methods, and Peng et al. ( 2022) have performed complex natural motion tasks using physically simulated characters. In order to successfully apply RL to real-world systems, it is essential to design a proper reward function which reflects safety guidelines, such as collision avoidance and limited energy consumption, as well as the goal of the given task. However, finding the reward function that considers all of such factors involves a cumbersome and time-consuming task since RL algorithms must be repeatedly performed to verify the results of the designed reward function. Instead, safe RL, which handles safety guidelines as constraints, can be an appropriate solution. A safe RL problem can be formulated using a constrained Markov decision process (Altman, 1999) , where not only the reward but also cost functions, which output the safety guideline signals, are defined. By defining constraints using risk measures, such as condtional value at risk (CVaR), of the sum of costs, safe RL aims to maximize returns while satisfying the constraints. Under the safe RL framework, the training process becomes straightforward since there is no need to search for a reward that reflects the safety guidelines. The most crucial part of safe RL is to satisfy the safety constraints, and it requires two conditions. First, constraints should be estimated with low biases. In general RL, the return is estimated using a function estimator called a critic, and, in safe RL, additional critics are used to estimate the constraint values. In our case, constraints are defined using risk measures, so it is essential to use distributional critics (Dabney et al., 2018b) . Then, the critics can be trained using the distributional Bellman update (Bellemare et al., 2017) . However, the Bellman update only considers the one-step temporal difference, which can induce a large bias. The estimation bias makes it difficult for critics to judge the policy, which can lead to the policy becoming overly conservative or risky, as shown in Section 5.3. Therefore, there is a need for a method that can train distributional critics with low biases.

