SDAC: EFFICIENT SAFE REINFORCEMENT LEARNING WITH LOW-BIAS DISTRIBUTIONAL ACTOR-CRITIC

Abstract

To apply reinforcement learning (RL) to real-world applications, agents are required to adhere to the safety guidelines of their respective domains. Safe RL can effectively handle the guidelines by converting them into constraints of the RL problem. In this paper, we develop a safe distributional RL method based on the trust region method, which can satisfy constraints consistently. However, policies may not meet the safety guidelines due to the estimation bias of distributional critics, and importance sampling required for the trust region method can hinder performance due to its significant variance. Hence, we enhance safety performance through the following approaches. First, we train distributional critics to have low estimation biases using proposed target distributions where bias-variance can be traded off. Second, we propose novel surrogates for the trust region method expressed with Q-functions using the reparameterization trick. Additionally, depending on initial policy settings, there can be no policy satisfying constraints within a trust region. To handle this infeasible issue, we propose a gradient integration method which guarantees to find a policy satisfying all constraints from an unsafe initial policy. From extensive experiments, the proposed method with risk-averse constraints shows minimal constraint violations while achieving high returns compared to existing safe RL methods. Furthermore, we demonstrate the benefit of safe RL for problems in which the reward cannot be easily specified.

1. INTRODUCTION

Deep reinforcement learning (RL) enables reliable control of complex robots (Merel et al., 2020; Peng et al., 2021; Rudin et al., 2022) . Miki et al. (2022) have shown that RL can control quadrupedal robots more robustly than existing model-based optimal control methods, and Peng et al. (2022) have performed complex natural motion tasks using physically simulated characters. In order to successfully apply RL to real-world systems, it is essential to design a proper reward function which reflects safety guidelines, such as collision avoidance and limited energy consumption, as well as the goal of the given task. However, finding the reward function that considers all of such factors involves a cumbersome and time-consuming task since RL algorithms must be repeatedly performed to verify the results of the designed reward function. Instead, safe RL, which handles safety guidelines as constraints, can be an appropriate solution. A safe RL problem can be formulated using a constrained Markov decision process (Altman, 1999) , where not only the reward but also cost functions, which output the safety guideline signals, are defined. By defining constraints using risk measures, such as condtional value at risk (CVaR), of the sum of costs, safe RL aims to maximize returns while satisfying the constraints. Under the safe RL framework, the training process becomes straightforward since there is no need to search for a reward that reflects the safety guidelines. The most crucial part of safe RL is to satisfy the safety constraints, and it requires two conditions. First, constraints should be estimated with low biases. In general RL, the return is estimated using a function estimator called a critic, and, in safe RL, additional critics are used to estimate the constraint values. In our case, constraints are defined using risk measures, so it is essential to use distributional critics (Dabney et al., 2018b) . Then, the critics can be trained using the distributional Bellman update (Bellemare et al., 2017) . However, the Bellman update only considers the one-step temporal difference, which can induce a large bias. The estimation bias makes it difficult for critics to judge the policy, which can lead to the policy becoming overly conservative or risky, as shown in Section 5.3. Therefore, there is a need for a method that can train distributional critics with low biases. Second, a policy update method considering safety constraints, denoted by a safe policy update rule, is required not only to maximize the reward sum but also to satisfy the constraints after updating the policy. Existing safe policy update rules can be divided into the trust region-based and Lagrangian methods. The trust region-based method calculates the update direction by approximating the safe RL problem within a trust region and updates the policy through a line search (Yang et al., 2020; Kim & Oh, 2022a) . The Lagrangian method converts the safe RL problem into a dual problem and updates the policy and Lagrange multipliers (Yang et al., 2021) . However, the Lagrangian method is difficult to guarantee satisfying constraints during training theoretically, and the training process can be unstable due to the multipliers (Stooke et al., 2020) . In contrast, trust region-based methods can guarantee to improve returns while satisfying constraints under tabular settings (Achiam et al., 2017) . Still, trust region-based methods also have critical issues. There can be an infeasible starting case, meaning that no policy satisfies constraints within the trust region due to initial policy settings. Thus, proper handling of this case is required, but there is a lack of such handling methods when there are multiple constraints. Furthermore, the trust region-based methods are known as not sampleefficient, as observed in several RL benchmarks (Achiam, 2018; Raffin et al., 2021) . In this paper, we propose an efficient trust region-based safe RL algorithm with multiple constraints, called a safe distributional actor-critic (SDAC). First, to train critics to estimate constraints with low biases, we propose a TD(λ) target distribution combining multiple-step distributions, where biasvariance can be traded off by adjusting the trace-decay λ. Then, under off-policy settings, we present a memory-efficient method to approximate the TD(λ) target distribution using quantile distributions (Dabney et al., 2018b) , which parameterize a distribution as a sum of Dirac functions. Second, to handle the infeasible starting case for multiple constraint settings, we propose a gradient integration method, which recovers policies by reflecting all constraints simultaneously. It guarantees to obtain a policy which satisfies the constraints within a finite time under mild technical assumptions. Also, since all constraints are reflected at once, it can restore the policy more stably than existing handling methods Xu et al. (2021) , which consider only one constraint at a time. Finally, to improve the efficiency of the trust region method as much as Soft Actor-Critic (SAC) (Haarnoja et al., 2018) , we propose novel SAC-style surrogates. We show that the surrogates have bounds within a trust region and empirically confirm improved efficiency in Appendix B. In summary, the proposed algorithm trains distributional critics with low biases using the TD(λ) target distributions and updates a policy using safe policy update rules with the SAC-style surrogates. If the policy cannot satisfy constraints within the trust region, the gradient integration method recovers the policy to a feasible policy set. To evaluate the proposed method, we conduct extensive experiments with four tasks in the Safety Gym environment (Ray et al., 2019) and show that the proposed method with risk-averse constraints achieves high returns with minimal constraint violations during training compared to other safe RL baselines. Also, we experiment with locomotion tasks using robots with different dynamic and kinematic models to demonstrate the advantage of safe RL over traditional RL, such as no reward engineering required. The proposed method has successfully trained locomotion policies with the same straightforward reward and constraints for different robots with different configurations.

2. BACKGROUND

Constrained Markov Decision Processes. We formulate the safe RL problem using constrained Markov decision processes (CMDPs) (Altman, 1999) . A CMDP is defined as (S, A, P , R, C 1,..,K , ρ, γ), where S is a state space, A is an action space, P : S × A × S → [0, 1] is a transition model, R : S × A × S → R is a reward function, C k∈{1,...,K} : S × A × S → R ≥0 are cost functions, ρ : S → [0, 1] is an initial state distribution, and γ ∈ (0, 1) is a discount factor. The state action value, state value, and advantage functions are defined as follows: Q π R (s, a) := E π,P ∞ t=0 γ t R(st, at, st+1) s0 = s, a0 = a , V π R (s) := E π,P ∞ t=0 γ t R(st, at, st+1) s0 = s , A π R (s, a) := Q π R (s, a) -V π R (s). (1) By substituting the costs for the reward, the cost value functions V π C k (s), Q π C k (s, a), A π C k (s, a) are defined. In the remainder of the paper, the cost parts will be omitted since they can be retrieved by replacing the reward with the costs. Given a policy π from a stochastic policy set Π, the discounted state distribution is defined as d π (s) := (1 -γ) ∞ t=0 γ t Pr(s t = s|π), and the return is defined as Z π R (s, a) := ∞ t=0 γ t R(s t , a t , s t+1 ) , where s 0 = s, a 0 = a, a t ∼ π(•|s t ), and s t+1 ∼ P (•|s t , a t ). Then, the safe RL problem is defined as follows with a safety measure F : max π E [Z π R (s, a)|s ∼ ρ, a ∼ π(•|s)] s.t. F (Z π C k (s, a)|s ∼ ρ, a ∼ π(•|s)) ≤ d k ∀k, where d k is a limit value of the kth constraint. Mean-Std Constraints. In our safe RL setting, we use mean-std as the safety measure: F (Z; α) = E[Z] + (ϕ(Φ -1 (α))/α) • Std[Z], where α ∈ (0, 1] adjusts conservativeness of constraints, Std[Z] is the standard deviation of Z, ϕ is the probability density function, and Φ is the cumulative distribution function (CDF) of the standard normal distribution. The mean-std is identical to the conditional value at risk (CVaR) if Z follows the Gaussian distribution, and the mean-std constraint can effectively reduce the number of constraint violations, as shown by Yang et al. (2021) ; Kim & Oh (2022b; a) . To estimate the mean-std of cost returns, Kim & Oh (2022b) define the square value functions: S π C k (s) := E π,P Z π C k (s, a) 2 |a ∼ π(•|s) , S π C k (s, a) := E π,P Z π C k (s, a) 2 , and A π S k (s, a) := S π C k (s, a) -S π C k (s). Additionally, d π 2 (s) := (1 -γ 2 ) ∞ t=0 γ 2t Pr(s t = s|π) denotes a doubly discounted state distribution. Then, the kth constraint can be written as follows: F k (π; α) = JC k (π) + ϕ(Φ -1 (α)) α JS k (π) -JC k (π) 2 ≤ d k , where J C k (π) := E s∼ρ V π C k (s) and J S k (π) := E s∼ρ S π C k (s) . Distributional Quantile Critic. To parameterize the distribution of the returns, Dabney et al. (2018b) have proposed an approximation method to estimate the returns using the following quantile distributions, called a distributional quantile critic: Pr(Z π R,θ (s, a) = z) := M m=1 δ θm(s,a) (z)/M , where M is the number of atoms, θ is a parametric model, and θ m (s, a) is the mth atom. The percentile value of the mth atom is denoted by τ m (τ 0 = 0, τ i = i/M ). In distributional RL, the returns are directly estimated to get value functions, and the target distribution can be calculated from the distributional Bellman operator (Bellemare et al., 2017) : T π Z R (s, a) : D = R(s, a, s ′ ) + γZ R (s ′ , a ′ ), where s ′ ∼ P (•|s, a) and a ′ ∼ π(•|s ′ ). The above one-step operator can be expanded to the n-step one: T π n Z R (s 0 , a 0 ) : D = n-1 t=0 γ t R(s t , a t , s t+1 ) + γ n Z R (s n , a n ). Then, the critic can be trained to minimize the following quantile regression loss (Dabney et al., 2018b) : L(θ) = M m=1 E Z∼Z ρ τm ( Z -θm) =:L τm QR (θm) , where ρτ (x) = x • (τ -1x<0), τm := τm-1 + τm 2 , and L τ QR (θ) denotes the quantile regression loss for a single atom. The distributional quantile critic can be plugged into existing actor-critic algorithms because only the critic modeling is changed.

3. PROPOSED METHOD

We propose the following three approaches to enhance the safety performance of trust region-based safe RL methods. First, we introduce a TD(λ) target distribution combining n-step distributions, which can trade off bias-variance. The target distribution enables training of the distributional critics with low biases. Second, we propose novel surrogate functions for policy updates that empirically improve the performance of the trust region method. Finally, we present a gradient integration method under multiple constraint settings to handle the infeasible starting cases.

3.1. TD(λ) TARGET DISTRIBUTION

In this section, we propose a target distribution by capturing that the TD(λ) loss, which is obtained by a weighted sum of several losses, and the quantile regression loss with a single distribution are equal. A recursive method is then introduced so that the target distribution can be obtained practically. First, the n-step targets are estimated as follows, after collecting trajectories (s t , a t , s t+1 , ...) with a behavioral policy µ: Ẑ(n) t : D = R t + γR t+1 + γ 2 R t+2 + • • • + γ n-1 R t+n-1 + γ n Z π R,θ (s t+n , a ′ t+n ), Figure 1 : Constructing procedure for target distribution. First, multiply the target at t + 1 step by γ and add R t . Next, weight-combine the shifted previous target and one-step target at t step and restore the CDF of the combined target. The CDF can be restored by sorting the positions of the atoms and then accumulating the weights at each atom position. Finally, the projected target can be obtained by finding the positions of the atoms corresponding to M ′ quantiles in the CDF. Using the projected target, the target at t -1 step can be found recursively. where R t = R(s t , a t , s t+1 ), a ′ t+n ∼ π(•|s t+n ), and π is the current policy. Note that the n-step target controls the bias-variance tradeoff using n. If n is equal to 1, the n-step target is equivalent to the temporal difference method that has low variance but high bias. On the contrary, if n goes to infinity, it becomes a Monte-Carlo estimation that has high variance but low bias. However, finding proper n is another cumbersome task. To alleviate this issue, TD(λ) (Sutton, 1988) method considers the discounted sum of all n-step targets. Similar to TD(λ), we define the TD(λ) loss for the distributional quantile critic as the discounted sum of all quantile regression losses with n-step targets. Then, the TD(λ) loss for a single atom is approximated using importance sampling of the sampled n-step targets in (5) as: L τ QR (θ) = (1-λ) ∞ i=0 λ i E Z∼T π i Z ρτ ( Z -θ) ≈ 1-λ M ∞ i=0 λ i i j=1 π(at+j|st+j) µ(at+j|st+j) M m=1 ρτ ( Ẑ(i+1) t,m -θ), where λ is a trace-decay value, and 6) is the same as the quantile regression loss with the following single distribution Ẑtot t , called a TD(λ) target distribution: Ẑ(i) t,m is the mth atom of Ẑ(i) t . Since Ẑ(i) t D = R t + γ Ẑ(i-1) t+1 is satisfied, ( Pr( Ẑtot t = z) := 1 N 1 -λ M ∞ i=0 λ i i j=1 π(at+j|st+j) µ(at+j|st+j) M m=1 δ Ẑ(i+1) t,m (z) = 1 N            (1 -λ) M m=1 1 M δ Ẑ(1) t,m One step TD target +λ π(at+1|st+1) µ(at+1|st+1) Pr(Rt + γ Ẑtot t+1 = z) Previous TD(λ) target            , ( ) where N is a normalization factor. We show that a distribution trained with the proposed target converges to the distribution of Z π in Appendix A.1. If the target for time step t + 1 is obtained, the target distribution for time step t becomes the weighted sum of the current one-step TD target and the shifted previous target distribution, so it can be obtained recursively, as shown in (7). However, to obtain the target distribution, we need to store all quantile positions and weights for all time steps, which is not memory-efficient. Therefore, we propose to project the target distribution into a quantile distribution with a specific number of atoms, M ′ (we set M ′ > M to reduce information loss). The overall process to get the TD(λ) target distribution is illustrated in Figure 1 , and the pseudocode is given in Appendix A.2. After calculating the target distribution for all time steps, the critic can be trained to reduce the quantile regression loss with the target distribution.

3.2. SAFE DISTRIBUTIONAL ACTOR-CRITIC

SAC-Style Surrogates. Here, we derive efficient surrogate functions for the trust region method. While there are two main streams in trust region methods: trust region policy optimization (TRPO) (Schulman et al., 2015) and proximal policy optimization (PPO) (Schulman et al., 2017) , however, we only consider TRPO since PPO is an approximation of TRPO by only considering the sum of rewards and, hence, cannot reflect safety constraints. There are several variants of TRPO (Nachum et al., 2018; Wang et al., 2017) , among which off-policy TRPO (Meng et al., 2022) shows significantly improved sample efficiency by using off-policy data. Still, the performance of SAC outperforms off-policy TRPO (Meng et al., 2022) , so we extend the surrogate of off-policy TRPO similar to the policy loss of SAC. To this end, the surrogate should 1) have entropy regularization and 2) be expressed with Q-functions. If we define the objective function with entropy regularization as: J(π) := E [ ∞ t=0 γ t (R(s t , a t , s t+1 ) + βH(π(•|s t )))|ρ, π, P ], where H is the Shannon entropy, we can defined the following surrogate function: J µ,π (π ′ ) := Es 0 ∼ρ [V π (s0)] + 1 1 -γ E d µ ,µ π ′ (a|s) µ(a|s) A π (s, a) + βE d π H(π ′ (•|s)) , where µ, π, π ′ are behavioral, current, and next policies, respectively. Then, we can derive a bound on the difference between the objective and surrogate functions. Theorem 1. Let us assume that max s H(π(•|s)) < ∞ for ∀π ∈ Π. The difference between the objective and surrogate functions is bounded by a term consisting of KL divergence as: J(π ′ ) -J µ,π (π ′ ) ≤ γ (1 -γ) 2 D max KL (π||π ′ ) √ 2βϵH + 2ϵR D max KL (µ||π ′ ) , where ϵ H = max s |H(π ′ (•|s))|, ϵ R = max s,a |A π (s, a)|, D max KL (π||π ′ ) = max s D KL (π(•|s)||π ′ (•|s)), and the equality holds when π ′ = π. We provide the proof in Appendix A.3. Theorem 1 demonstrates that the surrogate function can approximate the objective function with a small error if the KL divergence is kept small enough. We then introduce a SAC-style surrogate by replacing the advantage in ( 8) with Q-function as follows: J µ,π (π ′ ) = 1 1 -γ E d µ ,π ′ [Q π (s, a)] + βE d π H(π ′ (•|s)) + C, ( ) where C is a constant term for π ′ . The policy gradient can be calculated using the reparameterization trick, as done in SAC (Haarnoja et al., 2018) . We present the training results on continuous RL tasks in Appendix B, where the entropy-regularized ( 8) and SAC-style (10) versions are compared. Although ( 8) and ( 10) are mathematically equivalent, it can be observed that the performance of the SAC-style version is superior to the regularized version. We can analyze this with two factors. First, if using (8) in the off-policy setting, the importance ratios have significant variances, making training unstable. Second, the advantage function only gives scalar information about whether the sampled action is proper, whereas the Q-function directly gives the direction in which the action should be updated, so more information can be obtained from (10). Safe Policy Update. Now, we can apply the same reformulation to the surrogate functions for the safety constraints, which are defined by Kim & Oh (2022a) . The cost surrogate functions F µ,π k can be written in SAC-style form as follows: J µ,π C k (π ′ ) := Es∼ρ V π C k (s) + 1 1 -γ E d µ ,π ′ Q π C k (s, a) - 1 1 -γ E d µ V π C k (s) , J µ,π S k (π ′ ) := Es∼ρ S π C k (s) + 1 1 -γ 2 E d µ 2 ,π ′ S π C k (s, a) - 1 1 -γ 2 E d µ 2 S π C k (s) , F µ,π k (π ′ ; α) := J µ,π C k (π ′ ) + ϕ(Φ -1 (α)) α J µ,π S k (π ′ ) -(J µ,π C k (π ′ )) 2 . ( ) Remark that Kim & Oh (2022a) have shown that the cost surrogates are bounded in terms of KL divergence between the current and next policy. Thus, we can construct the following practical, safe policy update rule by adding a trust region constraint: π new = argmax π ′ J µ,π (π ′ ) s.t. F µ,π k (π ′ ; α) ≤ d k ∀k = 1, ..., K, D KL (π||π ′ ) ≤ ϵ, where D KL (π||π ′ ) := E s∼dµ [D KL (π(•|s)||π ′ (•|s))], and ϵ is a trust region size. As ( 12) is nonlinear, the objective and constraints are approximated linearly, while the KL divergence is approximated quadratically in order to determine the update direction. After the direction is obtained, a backtracking line search is performed. For more details, see Appendix A.5. Approximations. In the distributional RL setting, the cost value and the cost square value functions can be approximated using the quantile distribution critics as follows: Q π C (s, a) = ∞ -∞ zPr(Z π C (s, a) = z)dz ≈ 1 M M m=1 θm(s, a), S π C (s, a) = ∞ -∞ z 2 Pr(Z π C (s, a) = z)dz ≈ 1 M M m=1 θm(s, a) 2 . ( ) Finally, the proposed method is summarized in Algorithm 1. Update the policy by solving ( 12), but if (12) has no solution, take a recovery step (Section 3.3). end The proposed method updates a policy using ( 12), but the feasible set of ( 12) can be empty in the infeasible starting cases. To address the feasibility issue in safe RL with multiple constraints, one of the violated constraints can be selected, and the policy is updated to minimize the constraint until the feasible region is not empty (Xu et al., 2021) , which is called a naive approach. However, it may not be easy to quickly reach the feasible condition if only one constraint at each update step is used to update the policy. Therefore, we propose a gradient integration method to reflect all the constraints simultaneously. The main idea is to get a gradient that reduces the value of violated constraints and keeps unviolated constraints. To find such a gradient, the following quadratic program (QP) can be formulated by linearly approximating the constraints:

3.3. FEASIBILITY HANDLING FOR MULTIPLE CONSTRAINTS

g * = argmin g 1 2 g T Hg s.t. g T k g + c k ≤ 0, ∀k ∈ {1, ..., K}, ( ) where H is the Hessian of KL divergence at the current policy parameters ψ, g k is the gradient of the kth cost surrogate, c k = min( √ 2ϵg T k H -1 g k , F k (π ψ ; α) -d k + ζ) , ϵ is a trust region size, and ζ ∈ R >0 is a slack coefficient. Finally, we update the policy by ψ * = ψ + min(1,

√

2ϵ/(g * T Hg * ))g * . Figure 2 illustrates the proposed gradient integration process. Each constraint is truncated by c k to be tangent to the trust region, and the slanted lines show the feasible region of truncated constraints. The solution of ( 14) is indicated in red, pointing to the nearest point in the intersection of the constraints. If the solution crosses the trust region, parameters are updated by the clipped direction, shown in blue. Then, the policy can reach the feasibility condition within finite time steps. Theorem 2. Assume that the cost surrogates are differentiable and convex, gradients of the surrogates are L-Lipschitz continuous, eigenvalues of the Hessian are equal or greater than a positive value R ∈ R >0 , and {ψ|F k (π ψ ; α) + ζ < d k , ∀k} ̸ = ∅. Then, there exists E ∈ R >0 such that if 0 < ϵ ≤ E and a policy is updated by the proposed gradient integration method, all constraints are satisfied within finite time steps. We provide the proof and show the existence of a solution (14) in Appendix A.4. The provided proof shows that the constant E is proportional to ζ and inversely proportional to the number of constraints K. This means that the trust region size should be set smaller as K increases and ζ decreases. In conclusion, if the policy update rule (12) is not feasible, a finite number of applications of the proposed gradient integration method will make the policy feasible.

4. RELATED WORK

Safe Reinforcement Learning. There are various safe RL methods depending on how to update policies to reflect safety constraints. First, trust region-based methods (Achiam et al., 2017; Yang et al., 2020; Kim & Oh, 2022a) find policy update directions by approximating the safe RL problem and update policies through a line search. Second, Lagrangian-based methods (Stooke et al., 2020; Yang et al., 2021; Liu et al., 2020) convert the safe RL problem to a dual problem and update the policy and dual variables simultaneously. Last, expectation-maximization (EM) based methods (Liu et al., 2022) find non-parametric policy distributions by solving the safe RL problem in E-steps and fit parametric policies to the found non-parametric distributions in M-steps. Also, there are other ways to reflect safety other than policy updates. Qin et al. ( 2021); Lee et al. (2022) find optimal state or state-action distributions that satisfy constraints, and Bharadhwaj et al. (2021) ; Thananjeyan et al. (2021) reflect safety during exploration by executing only safe action candidates. In the experiments, only the safe RL methods of the policy update approach are compared with the proposed method. Distributional TD(λ). TD(λ) (Precup et al., 2000) can be extended to the distributional critic to trade off bias-variance. Gruslys et al. (2018) have proposed a method to obtain target distributions by mixing n-step distributions, but the method is applicable only in discrete action spaces. Nam et al. ( 2021) have proposed a method to obtain target distributions using sampling to apply to continuous action spaces, but this is only for on-policy settings. A method proposed by Tang et al. (2022) updates the critics using newly defined distributional TD errors rather than target distributions. This method is applicable for off-policy settings but has the disadvantage that memory usage increases linearly with the number of TD error steps. In contrast to these methods, the proposed method is memory-efficient and applicable for continuous action spaces under off-policy settings. Gradient Integration. The proposed feasibility handling method utilizes a gradient integration method, which is widely used in multi-task learning (MTL). The gradient integration method finds a single gradient to improve all tasks by using gradients of all tasks. Yu et al. (2020) have proposed a projection-based gradient integration method, which is guaranteed to converge Pareto-stationary sets. A method proposed by Liu et al. (2021) can reflect user preference, and Navon et al. (2022) proposed a gradient-scale invariant method to prevent the training process from being biased by a few tasks. The proposed method can be viewed as a mixture of projection and scale-invariant methods as gradients are clipped and projected onto a trust region.

5. EXPERIMENTS

We evaluate the safety performance of the proposed method and answer whether safe RL actually has the benefit of reducing the effort of reward engineering. For evaluation, agents are trained in the Safety Gym (Ray et al., 2019) with several tasks and robots. To check the advantage of safe RL, we construct locomotion tasks using legged robots with different models and different numbers of legs.

5.1. SAFETY GYM

Tasks. We employ two robots, point and car, to perform goal and button tasks in the Safety Gym. The goal task is to control a robot toward a randomly spawned goal without passing through hazard regions. The button task is to click a randomly designated button using a robot, where not only hazard regions but also dynamic obstacles exist. Agents get a cost when touching undesignated buttons and obstacles or entering hazard regions. There is only one constraint for all tasks, and it is defined using (3) with the sum of costs. Constraint violations (CVs) are counted when a robot contacts obstacles, unassigned buttons, or passes through hazard regions. Baselines. Safe RL methods based on various types of policy updates are used as baselines. For the trust region-based method, we use constrained policy optimization (CPO) (Achiam et al., 2017) and off-policy trust-region CVaR (OffTRC) (Kim & Oh, 2022a) , which extend the CPO to an off-policy and mean-std constrained version. For the Lagrangian-based method, worst-case soft actor-critic (WCSAC) (Yang et al., 2021) is used, and constrained variational policy optimization (CVPO) (Liu et al., 2022) based on the EM method is used. Specifically, WCSAC, OffTRC, and the proposed method, SDAC, use the mean-std constraints, so we experiment with those for α = 0.25, 0.5, and 1.0 (when α = 1.0, the constraint is identical to the mean constraint). Results. The graph of the final score and the total number of CVs are shown in Figure 3 , and the training curves are provided in Appendix D.1. If points are located in the upper left corner of the graph, the result can be interpreted as excellent since the score is high and the number of CVs is low. The frontiers of SDAC, indicated by the blue dashed lines in Figure 3 , are located in the upper left corners for all tasks. Hence, SDAC shows outstanding safety performance compared to other methods. In particular, SDAC with α = 0.25 shows comparable scores despite recording the lowest number of CVs in all tasks. Although the frontier overlaps with WCSAC in the car goal and point button tasks, WCSAC shows a high fluctuation of scores depending on the value of α. In addition, it can be seen that the proposed method enhances the efficiency and the training stability of the trust region method since SDAC shows high performance and small covariance compared to the other trust region-based methods, OffTRC and CPO.

5.2. LOCOMOTION TASKS

Tasks. The locomotion tasks are to train robots to follow xy-directional linear and z-directional angular velocity commands. Mini-Cheetah from MIT (Katz et al., 2019) and Laikago from Unitree (Wang, 2018) are used for quadrupedal robots, and Cassie from Agility Robotics (Xie et al., 2018) is used for a bipedal robot. In order to successfully perform the locomotion tasks, robots should keep balancing, standing, and stamping their feet so that they can move in any direction. Therefore, we define three constraints. The first constraint for balancing is to keep the body angle from deviating from zero, and the second for standing is to keep the height of the CoM above a threshold. The final constraint is to match the current foot contact state with a predefined foot contact timing. Especially, the contact timing is defined as stepping off the left and right feet symmetrically. The reward is defined as the negative l 2 -norm of the difference between the command and the current velocity. For more details, see Appendix C. Baselines. Through these tasks, we check the advantage of safe RL over traditional RL. Proximal policy optimization (PPO) (Schulman et al., 2017) , based on the trust region method, and truncated quantile critic (TQC) (Kuznetsov et al., 2020) , based on the SAC, are used as traditional RL baselines. To apply the same experiment to traditional RL, it is necessary to design a reward reflecting safety. We construct the reward through a weighted sum as R = (R - 3 i=1 w i C i )/(1 + 3 i=1 w i ), where R and C {1,2,3} are used to train safe RL methods and are defined in Appendix C, and R is called the true reward. The optimal weights are searched by a Bayesian optimization toolfoot_0 , which optimizes the true reward of PPO for the Mini-Cheetah task. The same weights are used for all robots and baselines to verify if reward engineering is required individually for each robot. Results. Figure 4 shows the true reward sum graphs according to the x-directional velocity command. The overall training curves are presented in Appendix D.2, and the demonstration videos are attached to the supplementary. The figure shows that SDAC performs the locomotion tasks successfully, observing that the reward sums of all tasks are almost zero. PPO shows comparable results in the Mini-Cheetah and Laikago since the reward of the traditional RL baselines is optimized for the Mini-Cheetah task of PPO. However, the reward sum is significantly reduced in the Cassie task, where the kinematic model largely differs from the other robots. TQC shows the lowest reward sums despite the state-of-the-art algorithm in other RL benchmarks (Kuznetsov et al., 2020) . From these results, it can be observed that reward engineering is required according to algorithms and robots.

5.3. ABLATION STUDY

We conduct ablation studies to show whether the proposed target distribution lowers the estimation bias and whether the proposed gradient integration quickly converges to the feasibility condition. In Figure 5a , the number of CVs is reduced as λ increases, which means that the bias of constraint estimation decreases. However, the score also decreases due to large variance, showing that λ can adjust the bias-variance tradeoff. In Figure 5b , the proposed gradient integration method is compared with a naive approach, which minimizes the constraints in order from the first to the third constraint, as described in Section 3.3. The proposed method reaches the feasibility condition faster than the naive approach and shows stable training curves because it reflects all constraints concurrently.

6. CONCLUSION

We have presented the trust region-based safe distributional RL method, called SDAC. To maximize the merit of the trust region method that can consistently satisfy constraints, we increase the performance by using the Q-function instead of the advantage function in the policy update. We have also proposed the memory-efficient, practical method for finding low-biased target distributions in off-policy settings to estimate constraints. Finally, we proposed the handling method for multiple constraint settings to solve the feasibility issue caused when using the trust region method. From extensive the experiments, we have demonstrated that SDAC with mean-std constraints achieved improved performance with minimal constraint violations and successfully performed the locomotion tasks without reward engineering.

A ALGORITHM DETAILS A.1 CONVERGENCE ANALYSIS

In this section, we show that the proposed TD(λ) target distribution converges to the Z π . First, we express the target distribution using a distributional operator and show that the operator is contractive. Finally, we show that Z π is the unique fixed point. Before starting the proof, we introduce useful notions, distance metrics, and operators. As the return Z π (s, a) is a random variable, we define the distribution of Z π (s, a) as ν π (s, a). Let η be the distribution of a random variable X. Then, we can express the distribution of affine transformation of random variable, aX + b, using the pushforward operator, which is defined by Rowland et al. (2018) , as (f a,b ) # (η). To measure a distance between two distributions, Bellemare et al. ( 2023) has defined the distance l p as follows: l p (η 1 , η 2 ) := R |F η1 (x) -F η2 (x)| p dx 1/p , where F ν (z) is the cumulative distribution function. This distance is 1/p-homogeneous, regular, and p-convex (see Section 4 of Bellemare et al. ( 2023) for more details). For functions that map stateaction pairs to distributions, a distance can be defined as (Bellemare et al., 2023) : lp (ν 1 , ν 2 ) := sup (s,a)∈S×A l p (ν 1 (s, a), ν 2 (s, a)). Then, the proposed TD(λ) target distribution can be expressed as an operator as below. T µ,π λ ν(s, a) := 1 -λ N ∞ i=0 λ i × E µ     i j=1 η(s j , a j )   E a ′ ∼π(•|si+1) (f γ i+1 , i t=0 γ t rt ) # (ν(s i+1 , a ′ )) s 0 = s, a 0 = a   , where η(s, a) = π(a|s) µ(a|s) . Then, the operator T µ,π λ has a contraction property. Theorem 3. Under the distance lp and the assumption that the state, action, and reward spaces are finite, T µ,π λ is γ 1/p -contractive. Proof. First, the operator can be rewritten using summation as follows. T µ,π λ ν(s, a) = 1 -λ N ∞ i=0 λ i a ′ ∈A (s0,a0,r0,...,si+1) Pr µ (s 0 , a 0 , r 0 , ..., s i+1 =:τ )   i j=1 η(s j , a j )   × π(a ′ |s i+1 )(f γ i+1 , i t=0 γ t rt ) # (ν(s i+1 , a ′ )) = 1 -λ N ∞ i=0 λ i a ′ ∈A τ Pr µ (τ )   i j=1 η(s j , a j )   π(a ′ |s i+1 ) s ′ ∈S 1 s ′ =si+1 × r ′ 0:i i k=0 1 r ′ k =r k (f γ i+1 , i t=0 γ t r ′ t ) # (ν(s ′ , a ′ )) = 1 -λ N ∞ i=0 λ i a ′ ∈A s ′ ∈S r ′ 0:i (f γ i+1 , i t=0 γ t r ′ t ) # (ν(s ′ , a ′ )) × E µ     i j=1 η(s j , a j )   π(a ′ |s i+1 )1 s ′ =si+1 i k=0 1 r ′ k =r k   =:w s ′ ,a ′ ,r ′ 0:i = 1 -λ N ∞ i=0 s ′ ∈S a ′ ∈A r ′ 0:i λ i w s ′ ,a ′ ,r ′ 0:i (f γ i+1 , i t=0 γ t r ′ t ) # (ν(s ′ , a ′ )). ( ) Since the sum of weights of distributions should be one, we can find the normalization factor N = (1 -λ) ∞ i=0 s∈S a∈A r0:i λ i w s,a,r0:i . Then, the following inequality can be derived using the homogeneity, regularity, and convexity of l p : l p p (T µ,π λ ν 1 (s, a), T µ,π λ ν 2 (s, a)) = l p p 1 -λ N ∞ i=0 s∈S a∈A r0:i λ i w s,a,r0:i (f γ i+1 , i t=0 γ t rt ) # (ν 1 (s, a)), 1 -λ N ∞ i=0 s∈S a∈A r0:i λ i w s,a,r0:i (f γ i+1 , i t=0 γ t rt ) # (ν 2 (s, a)) ≤ ∞ i=0 s∈S a∈A r0:i (1 -λ)λ i w s,a,r0:i N l p p (f γ i+1 , i t=0 γ t rt ) # (ν 1 (s, a)), (f γ i+1 , i t=0 γ t rt ) # (ν 2 (s, a)) ≤ ∞ i=0 s∈S a∈A r0:i (1 -λ)λ i w s,a,r0:i N l p p (f γ i+1 ,0 ) # (ν 1 (s, a)), (f γ i+1 ,0 ) # (ν 2 (s, a)) = ∞ i=0 s∈S a∈A r0:i (1 -λ)λ i w s,a,r0:i N γ i+1 l p p (ν 1 (s, a), ν 2 (s, a)) ≤ ∞ i=0 s∈S a∈A r0:i (1 -λ)λ i w s,a,r0:i N γ i+1 lp (ν 1 , ν 2 ) p ≤ γ lp (ν 1 , ν 2 ) p . (18) Therefore, lp (T µ,π λ ν 1 , T µ,π λ ν 2 ) ≤ γ 1/pl p (ν 1 , ν 2 ). By the Banach's fixed point theorem, the operator has a unique fixed distribution. From the definition of Z π , the following equality holds (Rowland et al., 2018) : ν π (s, a) = E π [(f γ,r ) # (ν π (s ′ , a ′ ))]. Then, it can be shown that ν π is the fixed distribution by applying the operator T µ,π λ to ν π : T µ,π λ ν π (s, a) = 1 -λ N ∞ i=0 λ i × E µ     i j=1 η(s j , a j )   E a ′ ∼π(•|si+1) (f γ i+1 , i t=0 γ t rt ) # (ν π (s i+1 , a ′ )) s 0 = s, a 0 = a   = 1 -λ N ∞ i=0 λ i E π (f γ i+1 , i t=0 γ t rt ) # (ν π (s i+1 , a i+1 )) s 0 = s, a 0 = a = 1 -λ N ∞ i=0 λ i ν π (s, a) = ν π (s, a). A.2 PSEUDOCODE OF TD(λ) TARGET DISTRIBUTION We provide the pseudocode for calculating TD(λ) target distribution for the reward critic in Algorithm 2. The target distribution for the cost critics can also be obtained by simply replacing the reward part with the cost. Algorithm 2: TD(λ) Target Distribution Data: Policy network π ψ , critic network Z π θ , and trajectory {(s t , a t , µ(a t |s t ), r t , d t , s t+1 )} T t=1 . Sample an action a ′ T +1 ∼ π ψ (s T +1 ) and get Ẑtot T = r T + (1 -d T )γZ π θ (s T +1 , a ′ T +1 ). Initialize the total weight w tot = λ. for t=T, 1 do Sample an action a ′ t+1 ∼ π ψ (s t+1 ) and get Ẑ(1) t = r t + (1 -d t )γZ π θ (s t+1 , a ′ t+1 ). Set the current weight w = 1 -λ. Combine the two targets, ( Ẑ(1) t , w) and ( Ẑ(tot) t , w tot ), and sort the combined target according to the positions of atoms. Build the CDF of the combined target by accumulating the weights at each atom. Project the combined target into a quantile distribution with M ′ atoms, which is Ẑ(proj) t , using the CDF (find the atom positions corresponding to each quantile). Update Ẑ(tot) t-1 = r t-1 + (1 -d t-1 )γ Ẑ(proj) t and w tot = λ π ψ (at|st) µ(at|st) (1 -d t-1 )(1 -λ + w tot ). end Return { Ẑ(proj) t } T t=1 .

A.3 PROOF OF THEOREM 1

Before showing the proof, we present a new function and a lemma. A value difference function is defined as follows: δ π ′ (s) := E [R(s, a, s ′ ) + γV π (s ′ ) -V π (s) | a ∼ π ′ (•|s), s ′ ∼ P (•|s, a)] = E a∼π ′ [A π (s, a)] . Lemma 4. The maximum of |δ π ′ (s) -δ π (s)| is equal or less than ϵ R 2D max KL (π||π ′ ). Proof. The value difference can be expressed in a vector form, δ π ′ (s) -δ π (s) = a (π ′ (a|s) -π(a|s))A π (s, a) = ⟨π ′ (•|s) -π(•|s), A π (s, •)⟩. Using Hölder's inequality, the following inequality holds: |δ π ′ (s) -δ π (s)| ≤ ||π ′ (•|s) -π(•|s)|| 1 • ||A π (s, •)|| ∞ = 2D TV (π ′ (•|s)||π(•|s))max a A π (s, a). ⇒ ||δ π ′ -δ π || ∞ = max s |δ π ′ (s) -δ π (s)| ≤ 2ϵ R max s D TV (π(•|s)||π ′ (•|s)). Using Pinsker's inequality, ||δ π ′ -δ π || ∞ ≤ ϵ R 2D max KL (π||π ′ ). Theorem 1. Let us assume that max s H(π(•|s)) < ∞ for ∀π ∈ Π. The difference between the objective and surrogate functions is bounded by a term consisting of KL divergence as: J(π ′ ) -J µ,π (π ′ ) ≤ γ (1 -γ) 2 D max KL (π||π ′ ) √ 2βϵH + 2ϵR D max KL (µ||π ′ ) , where ϵ H = max s |H(π ′ (•|s))|, ϵ R = max s,a |A π (s, a)|, D max KL (π||π ′ ) = max s D KL (π(•|s)||π ′ (•|s)), and the equality holds when π ′ = π. Proof. The surrogate function can be expressed in vector form as follows: J µ,π (π ′ ) = ⟨ρ, V π ⟩ + 1 1 -γ ⟨d µ , δ π ′ ⟩ + β⟨d π , H π ′ ⟩ , where H π ′ (s) = H(π ′ (•|s)). The objective function of π ′ can also be expressed in a vector form using Lemma 1 from Achiam et al. (2017) , J(π ′ ) = 1 1 -γ E R(s, a, s ′ ) + βH π ′ (s) | s ∼ d π ′ , a ∼ π ′ (•|s), s ′ ∼ P (•|s, a) = 1 1 -γ E s∼d π ′ δ π ′ (s) + βH π ′ (s) + E s∼ρ [V π (s)] = ⟨ρ, V π ⟩ + 1 1 -γ ⟨d π ′ , δ π ′ + βH π ′ ⟩. By Lemma 3 from Achiam et al. (2017) , ||d π -d π ′ || 1 ≤ γ 1-γ 2D max KL (π||π ′ ). Then, the following inequality is satisfied: |(1-γ)(J µ,π (π ′ ) -J(π ′ ))| = |⟨d π ′ -d µ , δ π ′ ⟩ + β⟨d π -d π ′ , H π ′ ⟩| ≤ |⟨d π ′ -d µ , δ π ′ ⟩| + β|⟨d π -d π ′ , H π ′ ⟩| = |⟨d π ′ -d µ , δ π ′ -δ π ⟩| + β|⟨d π -d π ′ , H π ′ ⟩| (∵ δ π = 0) ≤ ||d π ′ -d µ || 1 ||δ π ′ -δ π || ∞ + β||d π -d π ′ || 1 ||H π ′ || ∞ (∵ Hölder's inequality) ≤ 2ϵ R γ 1 -γ D max KL (µ||π ′ )D max KL (π||π ′ ) + βγϵ H 1 -γ 2D max KL (π||π ′ ) (∵ Lemma 4) = γ 1 -γ D max KL (π||π ′ ) √ 2βϵ H + 2ϵ R D max KL (µ||π ′ ) . If π ′ = π, the KL divergence term becomes zero, so equality holds.

A.4 PROOF OF THEOREM 2

We denote the policy parameter space as Ψ ⊆ R d , the parameter at the tth iteration as ψ t ∈ Ψ, the Hessian matrix as H(ψ t ) = ∇ 2 ψ D KL (π ψt ||π ψ )| ψ=ψt , and the kth cost surrogate as F k (ψ t ) = F µ,π k (π ψt ; α). As we focus on the tth iteration, the following notations are used for brevity: H = H(ψ t ) and g k = ∇F k (ψ t ). The proposed gradient integration at tth iteration is defined as the following quadratic program (QP): g t = argmin g 1 2 g T Hg s.t. g T k g + c k ≤ 0 for ∀k, where c k = min( 2ϵg T k H -1 g k , F k (π ψ ; α)-d k +ζ). In the remainder of this section, we introduce the assumptions and new definitions, discuss the existence of a solution (20), show the convergence to the feasibility condition for varying step size cases, and provide the proof of Theorem 2. Assumption. 1) Each F k is differentiable and convex, 2) ∇F k is L-Lipschitz continuous, 3) all eigenvalues of the Hessian matrix H(ψ) are equal or greater than R ∈ R >0 for ∀ψ ∈ Ψ, and 4) {ψ|F k (ψ) + ζ < d k for ∀k} ̸ = ∅. Definition. Using the Cholesky decomposition, the Hessian matrix can be expressed as H = B • B T where B is a lower triangular matrix. By introducing new terms, ḡk := B -1 g k and b t := B T g t , the following is satisfied: g T k H -1 g k = ||ḡ k || 2 2 . Additionally, we define the in-boundary and outboundary sets as: IB k := ψ|F k (ψ) -d k + ζ ≤ 2ϵ∇F k (ψ) T H -1 (ψ)∇F k (ψ) , OB k := ψ|F k (ψ) -d k + ζ ≥ 2ϵ∇F k (ψ) T H -1 (ψ)∇F k (ψ) . The minimum of ||ḡ k || in OB k is denoted as m k , and the maximum of ||ḡ k || in IB k is denoted as M k . Also, min k m k and max k M k is denoted as m and M , respectively, and we can say that m is positive. Lemma 5. For all k, the minimum value of m k is positive. Proof. Assume that there exist k ∈ {1, ..., K} such that m k is equal to zero at a policy parameter ψ * ∈ OB k , i.e., ||∇F k (ψ * )|| = 0. Since F k is convex, ψ * is a minimum point of F k , min ψ F k (ψ) = F k (ψ * ) < d k -ζ. However, F k (ψ * ) ≥ d k -ζ as ψ * ∈ OB k , so m k is positive due to the contradiction. Hence, the minimum of m k is also positive. Lemma 6. A solution of (20) always exists. Proof. There exists a policy parameter ψ ∈ {ψ|F k (ψ) + ζ < d k for ∀k} due to the assumptions. Let g = ψ -ψ t . Then, the following inequality holds. g T k (ψ -ψ t ) + c k ≤ g T k (ψ -ψ t ) + F k (ψ t ) + ζ -d k ≤ F k (ψ) + ζ -d k . (∵ F k is convex.) ⇒ g T k ( ψ -ψ t ) + c k ≤ F k ( ψ) + ζ -d k < 0 for ∀k. Since ψ -ψ t satisfies all constraints of (20), the feasible set is non-empty and convex. Also, H is positive definite, so the QP has a unique solution. Lemma 6 shows the existence of solution of ( 20). Now, we show the convergence of the proposed gradient integration method in the case of varying step sizes. Lemma 7. If √ 2ϵM ≤ ζ and a policy is updated by ψ t+1 = ψ t + β t g t , where 0 < β t < 2 √ 2ϵmR L||bt|| 2 and β t ≤ 1, the policy satisfies F k (ψ) ≤ d k for ∀k within a finite time. Proof. We can reformulate the step size as β = 2 √ 2ϵmR L||bt|| 2 β ′ t , where β ′ t ≤ L||bt|| 2 2 √ 2ϵmR and 0 < β ′ t < 1. Since the eigenvalues of H is equal to or bigger than R and H is symmetric and positive definite, 1 R I -H -1 is positive semi-definite. Hence, x T H -1 x ≤ 1 R ||x|| 2 is satisfied. Using this fact, the following inequality holds: F k (ψ t + β t g t ) -F k (ψ t ) ≤ β t ∇F k (ψ t ) T g t + L 2 ||β t g t || 2 (∵ ∇F k is L-Lipschitz continuous.) = β t g T k g t + L 2 β 2 t ||g t || 2 = β t g T k g t + L 2 β 2 t b T t H -1 b t (∵ g t = B -T b t ) ≤ -β t c k + L 2R β 2 t ||b t || 2 . (∵ g T k g t + c k ≤ 0) Now, we will show that ψ enters IB k in a finite time for ∀ψ ∈ OB k and that the kth constraint is satisfied for ∀ψ ∈ IB k . Thus, we divide into two cases, 1) ψ t ∈ OB k and 2) ψ t ∈ IB k . For the first case, c k = √ 2ϵ||ḡ k ||, so the following inequality holds: F k (ψ t + β t g t ) -F k (ψ t ) ≤ β t - √ 2ϵ||ḡ k || + L 2R β t ||b t || 2 ≤ β t √ 2ϵ (-||ḡ k || + mβ ′ t ) ≤ β t √ 2ϵm(β ′ t -1) < 0. The value of F k decreases strictly with each update step according to (21). Hence, ψ t can reach IB k by repeatedly updating the policy. We now check whether the constraint is satisfied for the second case. For the second case, the following inequality holds by applying c k = F k (ψ t ) -d k + ζ: F k (ψ t + β t g t ) -F k (ψ t ) ≤ β t d k -β t F k (ψ t ) -β t ζ + L 2R β 2 t ||b t || 2 ⇒F k (ψ t + β t g t ) -d k ≤ (1 -β t )(F k (ψ t ) -d k ) + β t (-ζ + √ 2ϵmβ ′ t ). Since ψ t ∈ IB k , F k (ψ t ) -d k ≤ √ 2ϵ||ḡ k || -ζ ≤ √ 2ϵM -ζ ≤ 0. Since m ≤ M and β ′ t < 1, -ζ + √ 2ϵmβ ′ t < -ζ + √ 2ϵM ≤ 0. Hence, F k (ψ t + β t g t ) ≤ d k , which means that the kth constraint is satisfied if ψ t ∈ IB k . As ψ t reaches IB k for ∀k within a finite time according to (21), the policy can satisfy all constraints within a finite time. Lemma 7 shows the convergence to the feasibility condition in the case of varying step sizes. We introduce a lemma, which shows ||b t || is bounded by √ ϵ, and finally show the proof of Theorem 2, which can be considered a special case of varying step sizes. Lemma 8. There exists T ∈ R >0 such that ||b t || ≤ T √ ϵ. Proof. By solving the dual problem of (20), g t can be expressed as: g t = - K k=1 λ k H -1 g k s.t. λ k = max c k -j̸ =k λ j g T j H -1 g k g T k H -1 g k , 0 for ∀k. The following inequality holds for ∀k: λ k ≤ max c k ||ḡ k || 2 , 0 ≤ max √ 2ϵ||ḡ k || ||ḡ k || 2 , 0 ≤ √ 2ϵ ||ḡ k || . Using triangular inequality, ||b t || = ||B T g t || = || k λ k B T H -1 g k || ≤ k λ k ||B T H -1 g k || ≤ √ 2ϵ k ||B T H -1 g k || ||ḡ k || = K √ 2ϵ. Hence, for every constant T > √ 2K, the statement holds. Theorem 2. Assume that the cost surrogates are differentiable and convex, gradients of the surrogates are L-Lipschitz continuous, eigenvalues of the Hessian are equal or greater than a positive value R ∈ R >0 , and {ψ|F k (π ψ ; α) + ζ < d k , ∀k} ̸ = ∅. Then, there exists E ∈ R >0 such that if 0 < ϵ ≤ E and a policy is updated by the proposed gradient integration method, all constraints are satisfied within finite time steps. Proof. The proposed step size is β t = min(1, √ 2ϵ/||b t ||), and the sufficient conditions that guarantee the convergence according to Lemma 7 are followings: √ 2ϵM ≤ ζ and 0 < β t ≤ 1 and β t < 2 √ 2ϵmR L||b t || 2 . The second condition is self-evident. To satisfy the third condition, the proposed step size β t should satisfy the followings: √ 2ϵ ||b t || < 2 √ 2ϵmR L||b t || 2 ⇔ ||b t || < 2mR L . If ϵ < 2((mR)/(LK)) 2 , the following inequality holds: √ 2ϵ < 2mR LK ⇒ ||b t || ≤ K √ 2ϵ < 2mR L . (∵ Lemma 8.) Hence, if ϵ ≤ E = 1 2 min( ζ 2 2M 2 , 2( mR LK ) 2 ), the sufficient conditions are satisfied.

A.5 POLICY UPDATE RULE

To solve the constrained optimization problem (12), we find a policy update direction by linearly approximating the objective and safety constraints and quadratically approximating the trust region constraint, as done by Achiam et al. (2017) . After finding the direction, we update the policy using a line search method. Given the current policy parameter ψ t ∈ Ψ, the approximated problem can be expressed as follows: x * = argmax x∈Ψ g T x s.t. 1 2 x T Hx ≤ ϵ, b T k x + c k ≤ 0 ∀k, where 22) is convex, we can use an existing convex optimization solver. However, the search space, which is the policy parameter space Ψ, is excessively large, so we reduce the space by converting ( 22) to a dual problem as follows: g = ∇ ψ J µ,π (π ψ )| ψ=ψt , H = ∇ 2 ψ D KL (π ψt ||π ψ )| ψ=ψt , b k = ∇ ψ F µ,π k (π ψ ; α)| ψ=ψt , and c k = F k (π ψ ; α) -d k . Since ( g(λ, ν) = min x L(x, λ, ν) = min x {g T x + ν( 1 2 x T Hx -ϵ) + λ T (Bx + c)} = -1 2ν   g T H -1 g =:q +2 g T H -1 B T =:r T λ + λ T BH -1 B T =:S λ    + λ T c -νϵ = -1 2ν (q + 2r T λ + λ T Sλ) + λ T c -νϵ, where B = (b 1 , .., b K ), c = (c 1 , ..., c K ) T , and λ ∈ R K ≥ 0 and ν ∈ R ≥ 0 are Lagrange multipliers. Then, the optimal λ and ν can be obtained by a convex optimization solver. After obtaining the optimal values, (λ * , ν * ) = argmax (λ,ν) g(λ, ν), the policy update direction x * are calculated by -1 ν * H -1 (B T λ * + g). Then, the policy is updated by ψ t+1 = ψ t + βx * , where β is a step size, which can be calculated by a line search method.

B ABLATION STUDY ON SURROGATE FUNCTIONS

We have extended the off-policy TRPO (Meng et al., 2022) to the entropy-regularized version and reformulated it as the SAC-style version. In this section, we evaluate the original, entropy-regularized, and SAC-style versions in the continuous control tasks of the MuJoCo simulators (Todorov et al., 2012) . We use neural networks with two hidden layers with (512, 512) nodes and ReLU for the activation function. The output of a value network is linear, but the input is different; the original and entropy-regularized versions use states, and the SAC-style version uses state-action pairs. The input of a policy network is the state, the output is mean µ and std σ, and actions are squashed into tanh(µ + ϵσ), ϵ ∼ N (0, 1) as in SAC (Haarnoja et al., 2018) . The entropy coefficient β in the entropy-regularized and SAC-style versions are adaptively adjusted to keep the entropy above a threshold (set as -d given A ⊆ R d ). The hyperparameters for all versions are summarized in Table 1 The training curves are presented in Figure 6 . All methods are trained with five different random seeds. Since there is no importance ratio and the Q-functions directly provide policy update direction, the SAC-style version outperforms the others. 

C EXPERIMENTAL SETTINGS

Safety Gym. We use the goal and button tasks with the point and car robots in the Safety Gym environment (Ray et al., 2019) , as shown in Figure 7a and 7b. The environmental setting for the goal task is the same as in Kim & Oh (2022b) . Eight hazard regions and one goal are randomly spawned at the beginning of each episode, and a robot gets a reward and cost as follows: R(s, a, s ′ ) = -∆d goal + 1 d goal ≤0.3 , C(s, a, s ′ ) = Sigmoid(10 • (0.2 -d hazard )), where d goal is the distance to the goal, and d hazard is the minimum distance to hazard regions. If d goal is less than or equal to 0.3, a goal is respawned, and the number of constraint violations (CVs) is counted when d hazard is less than 0.2. The state consists of relative goal position, goal distance, linear and angular velocities, acceleration, and LiDAR values. The action space is two-dimensional, which consists of xy-directional forces for the point and wheel velocities for the car robot. The environmental settings for the button task are the same as in Liu et al. (2022) . There are five hazard regions, four dynamic obstacles, and four buttons, and all components are fixed throughout the training. The initial position of a robot and an activated button are randomly placed at the beginning of each episode. The reward function is the same as in ( 24), but the cost is different since there is no dense signal for contacts. We define the cost function for the button task as an indicator function that outputs one if the robot makes contact with an obstacle or an inactive button or enters a hazardous region. We add LiDAR values of buttons and obstacles to the state of the goal task, and actions are the same as the goal task. The length of the episode is 1000 steps without early termination. Locomotion Tasks. We use three different legged robots, Mini-Cheetah, Laikago, and Cassie, for the locomotion tasks, as shown in Figure 7c , 7d, and 7e. The tasks aim to control robots to follow a velocity command on flat terrain. A velocity command is given by ( v cmd x , v cmd y , ω cmd z ), where v cmd x ∼ U(-1.0, 1.0) for Cassie and U(-1.0, 2.0) otherwise, v cmd y = 0, and ω cmd z ∼ U(-0.5, 0.5). To lower the task complexity, we set the y-directional linear velocity to zero but can scale to any nonzero value. As in other locomotion studies (Lee et al., 2020; Miki et al., 2022) , central phases are introduced to produce periodic motion, which are defined as ϕ i (t) = ϕ i,0 +f •t for ∀i ∈ {1, ..., n legs }, where f is a frequency coefficient and is set to 10, and ϕ i,0 is an initial phase. Actuators of robots are controlled by PD control towards target positions given by actions. The state consists of velocity command, orientation of the robot frame, linear and angular velocities of the robot, positions and speeds of the actuators, central phases, history of positions and speeds of the actuators (past two steps), and history of actions (past two steps). A foot contact timing ξ can be defined as follows: ξ i (s) = -1 + 2 • 1 sin(ϕi)≤0 ∀i ∈ {1, ..., n legs }, where a value of -1 means that the ith foot is on the ground; otherwise, the foot is in the air. For the quadrupedal robots, Mini-Cheetah and Laikago, we use the initial phases as ϕ 0 = {0, π, π, 0}, which generates trot gaits. For the bipedal robot, Cassie, the initial phases are defined as ϕ 0 = {0, π}, which generates walk gaits. Then, the reward and cost functions are defined as follows: R(s, a, s ′ ) = -0.1 • (||v base x,y -v cmd x,y || 2 2 + ||ω base z -ω cmd z || 2 2 + 10 -3 • Rpower), C1(s, a, s ′ ) = 1 angle≥a , C2(s, a, s ′ ) = 1 height≤b , C3(s, a, s ′ ) = n legs i=1 (1 -ξi • ξi)/(2 • n legs ), where the power consumption R power = i |τ i v i |, the sum of the torque times the actuator speed, is added to the reward as a regularization term, v base x,y is the xy-directional linear velocity of the base frame of robots, ω base z is the z-directional angular velocity of the base frame, and ξ ∈ {-1, 1} n legs is the current feet contact vector. For balancing, the first cost indicates whether the angle between the z-axis vector of the robot base and the world is greater than a threshold (a = 15 • for all robots). For standing, the second cost indicates the height of CoM is less than a threshold (b = 0.3, 0.35, 0.7 for Mini-Cheetah, Laikago, and Cassie, respectively), and the last cost is to check that the current feet contact vector ξ matches the pre-defined timing ξ. The length of the episode is 500 steps. There is no early termination, but if a robot falls to the ground, the state is frozen until the end of the episode. Hyperparameter Settings. The structure of neural networks consists of two hidden layers with (512, 512) nodes and ReLU activation for all baselines and the proposed method. The input of value networks is state-action pairs, and the output is the positions of atoms. The input of policy networks is the state, the output is mean µ and std σ, and actions are squashed into tanh(µ+ϵσ), ϵ ∼ N (0, 1). We use a fixed entropy coefficient β. The trust region size ϵ is set to 0.001 for all trust region-based methods. The overall hyperparameters for the proposed method can be summarized in Table 2 . Since the range of the cost is [0, 1], the maximum discounted cost sum is 1/(1 -γ). Thus, the limit value is set by target cost rate times 1/(1 -γ). For the locomotion tasks, the third cost in ( 26) is designed for foot stamping, which is not essential to safety. Hence, we set the limit value to near the maximum (if a robot does not stamp, the cost rate becomes 0.5). The reward weights are also presented in Table 2 , which are optimized using the existing Bayesian optimization tool. In addition, baseline methods use multiple critic networks for the cost function, such as target (Yang et al., 2021) or square value networks (Kim & Oh, 2022a) . To match the number of network parameters, we use two critics as an ensemble, as in Kuznetsov et al. (2020) . Tips for Hyperparameter Tuning. • Discount factor γ, Critic learning rate: Since these are commonly used hyperparameters, we do not discuss these. • Trace-decay λ, Trust region size ϵ: The ablation studies on these hyperparameters are presented in Appendix D.4. From the results, we recommend setting the trace-decay to 0.95 ∼ 0.99 as in other TD(λ)-based methods (Precup et al., 2000) . Also, the results show that the performance is not sensitive to the trust region size. However, if the trust region size is too large, the approximation error increases, so it is better to set it below 0.003. • Entropy coefficient β: This value is fixed in our experiments, but it can be adjusted automatically as done in SAC (Haarnoja et al., 2018) . • The number of atoms M, M ′ : Although experiments on the number of atoms did not performed, performance is expected to increase as the number of atoms increases, as in other distributional RL methods Dabney et al. (2018a) In this section, we present the training curves of the Safety Gym tasks separately according to the conservativeness of constraints for better readability. Figure 8 shows the training results of the mean constrained and mean-std constrained algorithms with α = 1.0. Figures 9 and 10 show the training results of the mean-std constrained algorithms with α = 0.25 and 0.5, respectively. In Figure 8 , it can be observed that the score of SDAC has the fastest convergence speed and that the cost rates also converge to the limit values quickly. Observing that all the other methods show the highest total CVs in the car button task, this task is challenging to meet the constraint. Thus, SDAC also has a higher cost rate for the car button than other tasks. In addition, since decreasing α makes the constraints more conservative, the cost rates and the number of total CVs of SDAC are reduced in Figures 9 and 10. For α = 0.25 and 0.5, SDAC shows the highest scores and the lowest number of CVs in all tasks. 



We use Sweeps from Weights & Biases(Biewald, 2020).



Safe Distributional Actor-Critic Data: Policy network π ψ , reward and cost critic networks Z π R,θ , Z π C k ,θ , and replay buffer D. Initialize network parameters ψ, θ, and replay buffer D. for epochs=1, E do for t=1, T do Sample at ∼ π ψ (•|st) and get st+1, rt = R(st, at, st+1), c k,t = C k (st, at, st+1). Store (st, at, π ψ (at|st), rt, c k,t , st+1) in D. end Calculate the TD(λ) target distribution (Section 3.1) with D and update the critics to minimize (4). Calculate the surrogate (10) and the cost surrogates (11) with D.

Figure 2: Gradient Integration.

Figure 3: Graphs of final scores and the total number of CVs for the Safety Gym tasks. The number after the algorithm name in the legend indicates α used for the constraint. The center and boundary of ellipses are drawn using the mean and covariance of the five runs for each method. Dashed lines connect results of the same method with different α.

Figure 4: True reward sum graphs according to the x-directional command. The true reward is defined in Appendix C. The solid line and shaded area represent average and one fifth of std value, respectively. The graphs are obtained by running ten episodes per seed for each command.

(a) Training curves of the point goal task according to the trace-decay λ. (b) Training curves of the naive and proposed methods for the Cassie task.

Figure 5: Ablation results. The cost rates show the cost sums divided by the episode length. The shaded area represents the standard deviation. The black lines indicate the limit values, and the dotted lines in (b) represent the limit values + 0.025.

Figure 6: MuJoCo training curves.

Figure 7: (a) and (b) are Safety Gym tasks. (c), (d), and (e) are locomotion tasks.

Figure 8: Training curves of mean constrained algorithms for the Safety Gym tasks. The solid line and shaded area represent the average and std values, respectively. The solid black lines in the second row indicate limit values. All methods are trained with five random seeds.

Figure 9: Training curves of mean-std constrained algorithms with α = 0.5 for the Safety Gym.

Figure 10: Training curves of mean-std constrained algorithms with α = 0.25 for the Safety Gym.

. Hyperparameters for all versions.

Hyperparameter settings for the Safety Gym and locomotion tasks.

D.3 ABLATION STUDY ON COMPONENTS OF SDAC

We experiment with variations of SDAC to examine the effectiveness of each component. SDAC has two main components, SAC-style surrogate functions and distributional critics. We call SDAC with only distribution critics, SDAC-Dist, and SDAC with only SAC-style surrogates, SDAC-Q. If all components are absent, SDAC is identical to OffTRC (Kim & Oh, 2022a) . The variants are trained with the point goal task of the Safety Gym, and the training results are shown in Figure 12 . SDAC-Q lowers the cost rate quickly but shows the lowest score. SDAC-Dist shows scores similar to SDAC, but the cost rate converges above the limit value 0.025. In conclusion, SDAC can efficiently satisfy the safety constraints through the SAC-style surrogates and improve score performance through the distributional critics. 

D.4 ABLATION STUDY ON HYPERPARAMETERS

To check the effects of the hyperparameters, we conduct ablation studies on the trust region size and entropy coefficient. The results are shown in Figure 13 . From the entropy coefficient results, it can be seen that excessive exploration causes the constraint to be violated. Thus, the entropy coefficient should be adjusted cautiously, or it can be better to set the coefficient to zero. Since Theorem 1 shows that the estimation error of the surrogates is proportional to the trust region size, it can be observed from Figure 13b that the number of CVs increases with the size of the trust region due to the estimation error. 

E COMPUTATIONAL COST ANALYSIS

In this section, we analyze the computational cost of the gradient integration method. The proposed gradient integration method has three subparts. First, it is required to calculate policy gradients of each cost surrogate, g k , and H -1 g k for ∀k ∈ {1, 2, ..., K}, where H is the Hessian matrix of the KL divergence. H -1 g k can be computed using the conjugate gradient method, which requires only a constant number of back-propagation on the cost surrogate, so the computational cost can be expressed as K • O(BackProp).Second, the quadratic problem in Section 3.3 is transformed to a dual problem, where the transformation process requires inner products between g k and H -1 g m for ∀k, m ∈ {1, 2, ..., K}. The computational cost can be expressed as K 2 • O(InnerProd).Finally, the transformed quadratic problem is solved in the dual space ∈ R K using a quadratic programming solver. Since K is usually much smaller than the number of policy parameters, the computational cost almost negligible compared to the others. Then, the cost of the gradient integration is K • O(BackProp) + K 2 • O(InnerProd) + C. Since the back-propagation and the inner products is proportional to the number of policy parameters |ψ|, the computational cost can be simplified as O(K 2 • |ψ|).

