INTERPRETING DISTRIBUTIONAL REINFORCEMENT LEARNING: A REGULARIZATION PERSPECTIVE

Abstract

Distributional reinforcement learning (RL) is a class of state-of-the-art algorithms that estimate the entire distribution of the total return rather than its expected value alone. The theoretical advantages of distributional RL over expectationbased RL remain elusive, despite the remarkable performance of distributional RL. Our work attributes the potential superiority of distributional RL to its regularization effect stemming from the value distribution information regardless of only its expectation. We decompose the value distribution into its expectation and the remaining distribution part using a variant of the gross error model in robust statistics. Hence, distributional RL has an additional benefit over expectationbased RL thanks to the impact of a risk-sensitive entropy regularization within the Neural Fitted Z-Iteration framework. Meanwhile, we investigate the role of the resulting regularization in actor-critic algorithms by bridging the risk-sensitive entropy regularization of distributional RL and the vanilla entropy in maximum entropy RL. It reveals that distributional RL induces an augmented reward function, which promotes a risk-sensitive exploration against the intrinsic uncertainty of the environment. Finally, extensive experiments verify the importance of the regularization effect in distributional RL, as well as the mutual impacts of different entropy regularizations. Our study paves the way towards a better understanding of distributional RL, especially when looked at through a regularization lens. Under review as a conference paper at ICLR 2023 within which we establish an equivalence of objective functions between distributional RL and a risk-sensitive entropy regularized Neural Fitted Q-Iteration from the perspective of statistics. This result is based on two analytical components, i.e., action-value density function decomposition by leveraging of a variant of gross error model in robust statistics, as well as Kullback-Leibler (KL) divergence to measure the distribution distance between the current and target value distribution in each Bellman update. Then we establish a connection between the impact of risk-sensitive entropy regularization of distributional RL and vanilla entropy in maximum entropy RL, yielding a Distribution-Entropy-Regularized Actor Critic algorithm. Empirical results demonstrate the crucial role of risk-sensitive entropy regularization effect from distributional RL in the potential superiority over expectation-based RL on both Atari games and MuJoCo environments. We also reveal mutual impacts of both risk-sensitive entropy in distributional RL and vanilla entropy in maximum entropy RL, providing more potential research directions in the future. In classical RL, an agent interacts with an environment via a Markov decision process (MDP), a 5-tuple (S, A, R, P, γ), where S and A are the state and action spaces, respectively. P is the environment transition dynamics, R is the reward function and γ ∈ (0, 1) is the discount factor. Action-value Function vs Action-value Distribution. Given a policy π, the discounted sum of future rewards is a random variable Z π (s, a) = ∞ t=0 γ t R(s t , a t ), where s 0 = s, a 0 = a, s t+1 ∼ P (•|s t , a t ), and a t ∼ π(•|s t ). In the control setting, expectation-based RL focuses on the actionvalue function Q π (s, a), the expectation of Z π (s, a), i.e., Q π (s, a) = E [Z π (s, a)]. Distributional RL, on the other hand, focuses on the action-value distribution, the full distribution of Z π (s, a). The density function if exists of action-value distribution is called action-value density function. Bellman Operators vs Distributional Bellman Operators. For the policy evaluation in expectation-based RL, the value function is updated via the Bellman operator T π Q(s, a) = E[R(s, a)] + γE s ∼p,a ∼π [Q (s , a )]. We also define Bellman Optimality Operator T opt Q(s, a) = E[R(s, a)] + γ max a E s ∼p [Q (s , a )]. In distributional RL, the action-value distribution of Z π (s, a) is updated via the distributional Bellman operator T π , i.e., T π Z(s, a)

1. INTRODUCTION

The intrinsic characteristics of classical reinforcement learning (RL) algorithms, such as temporaldifference (TD) learning (Sutton & Barto, 2018) and Q-learning (Watkins & Dayan, 1992) , are based on the expectation of discounted cumulative rewards that an agent observes while interacting with the environment. In stark contrast to the classical expectation-based RL, a new branch of algorithms called distributional RL estimates the full distribution of total returns and has demonstrated the stateof-the-art performance in a wide range of environments (Bellemare et al., 2017a; Dabney et al., 2018b; a; Yang et al., 2019; Zhou et al., 2020; Nguyen et al., 2020; Sun et al., 2022) . Meanwhile, distributional RL also inherits other benefits in risk-sensitive control (Dabney et al., 2018a) , policy exploration settings (Mavrin et al., 2019; Rowland et al., 2019) and robustness (Sun et al., 2021) . Despite the existence of numerous algorithmic variants of distributional RL with remarkable empirical success, we still have a poor understanding of what the effectiveness of distributional RL is stemming from and theoretical studies on advantages of distributional RL over expectation-based RL are still less established. Distributional RL problems was also mapped to a Wasserstein gradient flow problem (Martin et al., 2020) , treating the distributional Bellman residual as a potential energy functional. Offline distributional RL (Ma et al., 2021) has also been proposed to investigate the efficacy of distributional RL in both risk-neutral and risk-averse domains. (Lyle et al., 2019) proved in many realizations of tabular and linear approximation settings, distributional RL behaves the same as expectation-based RL under the coupling updates method, but diverges in non-linear approximation. Although the explanation from these works is not sufficient yet, the trend is encouraging for recent works towards closing the gap between theory and practice in distributional RL. In this paper, we illuminate the behavior difference of distributional RL over expectation-based RL through the lens of regularization to explain its empirical outperformance in most practical environments. Specifically, we simplify distributional RL into a Neural Fitted Z-Iteration framework, D = R(s, a) + γZ (s , a ), where s ∼ P (•|s, a) and a ∼ π (•|s ). The equality implies that random variables of both sides are equal in distribution. This random-variable definition of distributional Bellman operator is appealing and easily understood due to its concise form, although its value-distribution definition is more mathematically rigorous (Rowland et al., 2018; Bellemare et al., 2022) . Categorical Distributional RL. Categorical Distributional RL (Bellemare et al., 2017a) can be viewed as the first successful distributional RL algorithm family that approximates the value distribution η by a discrete categorical distribution η = N i=1 p i δ zi , where z 1 ≤ z 2 ≤ ... ≤ z N is a set of fixed supports and {p i } N i=1 are learnable probabilities. The leverage of a heuristic projection operator Π C (see Appendix A for more details) as well as the KL divergence allows the theoretical convergence of categorical distribution RL under Cramér distance (Rowland et al., 2018) .

3.1. DISTRIBUTIONAL RL: NEURAL FITTED Z-ITERATION (NEURAL FZI)

Expectation-based RL: Neural Fitted Q-Iteration (Neural FQI). Neural FQI (Fan et al., 2020; Riedmiller, 2005) offers a statistical explanation of DQN (Mnih et al., 2015) , capturing its key features, including experience replay and the target network Q θ * . In Neural FQI, we update parameterized Q θ (s, a) in each iteration k in a regression problem: Q k+1 θ = argmin Q θ 1 n n i=1 y i -Q k θ (s i , a i ) 2 , where the target y i = r(s i , a i ) + γ max a∈A Q k θ * (s i , a) is fixed within every T target steps to update target network Q θ * by letting θ * = θ. The experience buffer induces independent samples {(s i , a i , r i , s i )} i∈ [n] . In an ideal case when we neglect the non-convexity and TD approximation errors, we have Q k+1 θ = T opt Q k θ * , which is exactly the updating rule under Bellman optimality operator (Fan et al., 2020) . In the viewpoint of statistics, the optimization problem in Eq. 1 in each iteration is a standard supervised and neural network parameterized regression regarding Q θ . Distributional RL: Neural Fitted Z-Iteration (Neural FZI). We interpret distributional RL as a Neural Fitted Z-Iteration owing to the fact that this iteration is by far closest to the practical algorithms and more interpretable. Analogous to Neural FQI, we can simplify value-based distributional RL algorithms parameterized by Z θ into a Neural Fitted Z-Iteration (Neural FZI) as Z k+1 θ = argmin Z θ 1 n n i=1 d p (Y i , Z k θ (s i , a i )), where the target Y i = R(s i , a i ) + γZ k θ * (s i , π Z (s i )) with the policy π Z following the greedy rule π Z (s i ) = argmax a E Z k θ * (s i , a ) is fixed within every T target steps to update target network Z θ * . d p is a divergence between two distributions. Notably, choices of representation for Z θ and the metric d p are pivotal for the empirical success of distributional RL algorithms (Sun et al., 2022) .

3.2. DISTRIBUTIONAL RL: ENTROPY-REGULARIZED NEURAL FQI

Action-Value Density Function Decomposition. To separate the impact of additional distribution information from the expectation of Z π , we leverage a variant of gross error model from robust statistics (Huber, 2004) , which was also similarly used to analyze Label Smoothing (Müller et al., 2019) and Knowledge Distillation (Hinton et al., 2015) . Particularly, we utilize a histgram p s,a (x) with N bins to approximate an arbitrary continuous action-value density function p s,a (x) given a state s and action a as the histogram is probably the simplest approach for the density function estimate in the literature of non-parametric statistics. Given a fixed set of supports z 0 ≤ z 1 ≤ ... ≤ z N with the equal bin size as ∆, ∆ i = [z i-1 , z i ), i = 1, ..., N -1 with ∆ N = [z N -1 , z N ], the continuous histogram density function is p s,a (x) = N i=1 p i 1(x ∈ ∆ i )/∆. Denote ∆ E as the interval that E [Z π (s, a)] falls into, i.e., E [Z π (s, a)] ∈ ∆ E . We conduct an action-value density function decomposition over p s,a (x) as follows: p s,a (x) = (1 -)1(x ∈ ∆ E )/∆ + N i=1 p µ i 1(x ∈ ∆ i )/∆, where p s,a induces a new histogram µ(x) = N i=1 p µ i 1(x ∈ ∆ i )/∆ that is used to approximate a continuous density function µ(x). µ(x) or µ(x) aims at characterizing the impact of action-value distribution regardless of its expectation E [Z π (s, a)] on the performance of distributional RL algorithms. controls the proportion between a single-bin histogram 1(x ∈ ∆ E )/∆ and µ(x), where we will later show that this single-bin histogram function is linked to Neural FQI. Before diving deeper, we begin by showing that µ(x) is a valid density function under certain in Proposition 1. Proposition 1. Denote p s,a (x ∈ ∆ E ) = p E /∆. Following the density function decomposition in Eq. 3, µ(x) = N i=1 p µ i 1(x ∈ ∆ i )/∆ is a valid probability density function ⇐⇒ ≥ 1 -p E . Proof is provided in Appendix B. We next show that the histogram density estimator p s,a (x) enjoys a uniform convergence rate to approximate an arbitrary continuous action-value density function p s,a (x) under the mild condition in Theorem 1. Proof is provided in Appendix C. Theorem 1. (Approximation Analysis of p s,a ) Suppose p s,a (x) is Lipschitz continuous and the support of a random variable is partitioned by N bins with bin size ∆. Then Distributional RL: Entropy-regularized Neural FQI. We apply the decomposition on the target action-value histogram density function and choose KL divergence as d p in Neural FZI. Let H(P, Q) be the cross entropy between two probability measures P and Q, i.e., H(P, Q) = -x∈X P (x) log Q(x) dx. The target histogram density function p s,a is decomposed as p s,a (x) = (1 -)1(x ∈ ∆ E )/∆ + µ(x). We can derive the following entropy-regularized form for distributional RL in Proposition 3. Proposition 3. Denote q s,a θ (x) as the histogram density function of Z k θ (s, a) in Neural FZI. Based on the decomposition in Eq. 3 and KL divergence as d p , Neural FZI in Eq. 2 is simplified as sup x | p s,a (x) -p s,a (x)| = O (∆) + O P log N n∆ 2 . ( Z k+1 θ = argmin q θ 1 n n i=1 -log q si,ai θ (∆ i E ) + αH( µ s i ,π Z (s i ) , q si,ai θ ) , where α = ε/(1 -ε) > 0 and ∆ i E represents the interval that E [Z π (s i , π Z (s i ))] falls into, i.e., E [Z π (s i , π Z (s i ))] ∈ ∆ i E . µ s i ,π Z (s i ) is the resulting histogram density function in the next state action pair (s i , π Z (s i )). Proof is given in Appendix F. In Proposition 4 with proof in Appendix G, we further show that minimizing the first term in Eq. 5 is "almost" equivalent to minimizing Neural FQI . For the uniformity of notation, we still use s, a in the following analysis instead of s i , a i . Proposition 4. (Connection between Neural FZI and FQI via Decomposition) In Eq. 5 of Neural FZI, if the function class {Z θ : θ ∈ Θ} is sufficiently large such that it contains the target {Y i } n i=1 , where Y i = R(s i , a i ) + γZ k θ * (s i , π Z (s i )). Minimizing the first term in Eq. 5, as ∆ → 0 implies P (Z k+1 θ (s, a) = T opt Q k θ * (s, a)) = 1. ( ) Interpretation of Proposition 4. Given the fact that (Fan et al., 2020) , we have Z k+1 θ = T opt Q k θ * with probability one under the assumption in Proposition 4. This indicates Z k+1 θ may take other values instead of its expectation part as E Z k+1 θ = T opt Q k θ * , but the probability when these events for other values happen is 0. This result establishes a theoretical link between Neural FZI regarding the first term in Eq. 5 with Neural FQI. Q k+1 θ = T opt Q k θ * ideally in Neural FQI Interpretation of Proposition 3. Based on the equivalence between the first term of Neural FZI and FQI, we therefore interpret the distributional RL form in Eq. 5 as entropy-regularized Neural FQI. Thus, the second regularization term H( µ s i ,π Z (s i ) , q si,ai θ ) aims at explaining the behavior difference between distributional RL and expectation-based RL. It pushes q s,a θ for the current state-action pair to approximate µ s i ,π Z (s i ) for the next state-action pair, which "deducts" the expectation effect from the whole action-value distribution by leveraging of the density function decomposition technique proposed in Eq. 3. In summary, we interpret impacts of these two terms in Eq. 5 on the distributional RL optimization as expectation effect and distributional regularization effect, respectively. Risk-Sensitive Entropy Regularization. We attribute the behavior difference of distributional RL, especially the ability to significantly reduce intrinsic uncertainty of the environment (Mavrin et al., 2019) , into the regularization term in Eq. 5. According to the literature of risks in RL (Dabney et al., 2018a) , where "risk" refers to the uncertainty over possible outcomes and "risk-sensitive policies" are those which depend upon more than the mean of the outcomes, we hereby call the novel cross entropy regularization for the second term in Eq. 5 as risk-sensitive entropy regularization. This risk-sensitive entropy regularization derived within distributional RL expands the class of policies using information provided by the distribution over returns (i.e. to the class of risk-sensitive policies). It should also be noted that our risk-sensitive entropy regularization is indeed "risk-neutral" in the sense of convexity or concaveness of utility functions, where our policy is still applying a linear utility function U , defined as π(•|s) = arg max a E Z(s,a) [U (z)]. Correspondingly, We can additionally vary different distortion risk measures to explicitly lead the policy to being risk-averse or risk-seeking (Dabney et al., 2018a) . Remark on KL divergence. As stated in categorical distributional RL in Section 2, when the categorical distribution is applied after the projection operator Π C , distributional Bellman operator T π has the contraction guarantee under Cramér distance (Rowland et al., 2018) , albeit the use of a nonexpansive KL divergence (Morimura et al., 2012) . Similarly, our histogram density function with the projection Π C equipped with KL divergence also enjoys a contraction property due to the equivalence between optimizing histogram function and categorical distribution analyzed in Proposition 2. We also summarize favorable properties of KL divergence in distributional RL in Appendix E. Remark on the Attainability of µ s ,π Z (s ) . In practical distributional RL algorithms, we typically use the bootstrap, e.g., TD learning, to attain the target probability density estimate µ s ,π Z (s ) based on Eq. 3 as long as E [Z(s, a)] exists and ≥ 1 -p E in Proposition 1. The leverage of µ s ,π Z (s ) and the regularization effect revealed in Eq. 5 of distributional RL de facto establishes a bridge with maximum entropy RL (Williams & Peng, 1991) , on which we have a deeper analysis in Section 3.3.

3.3. CONNECTION WITH MAXIMUM ENTROPY RL

Vanilla Entropy Regularization in Maximum Entropy RL. Maximum entropy RL (Williams & Peng, 1991) , including Soft Q-Learning (Haarnoja et al., 2017) , explicitly optimizes for policies that aim to reach states where they will have high entropy in the future: J(π) = T t=0 E (st,at)∼ρπ [r (s t , a t ) + βH(π(•|s t ))] , where H (π θ (•|s t )) = -a π θ (a|s t ) log π θ (a|s t ) and ρ π is the generated distribution following π. The temperature parameter β determines the relative importance of the entropy term against the cumulative rewards, and thus controls the action diversity of the optimal policy learned via Eq. 7. This maximum entropy regularization has various conceptual and practical advantages. Firstly, the learned policy is encouraged to visit states with high entropy in the future, thus promoting the exploration over diverse states (Han & Sung, 2021) . Secondly, it considerably improves the learning speed (Mei et al., 2020) and therefore is widely used in state-of-the-art algorithms, e.g., Soft Actor-Critic (SAC) (Haarnoja et al., 2018) . Similar empirical benefits of both distributional RL and maximum entropy RL also encourage us to probe their underlying connection. Risk-Sensitive Entropy Regularization in Distributional RL. To make a direct comparison with maximum entropy RL, we need to specifically analyze the impact of the regularization term in Eq. 5, and thus we incorporate the risk-sensitive entropy regularization of distributional RL into the policy gradient framework akin to maximum entropy RL. Concretely, we conduct our analysis by showing the convergence of Distribution-Entropy-Regularized Policy Iteration (DERPI), which is the counterpart for Soft Policy Iteration (Haarnoja et al., 2018) , i.e., the underpinning of SAC algorithm. In principle, Distribution-Entropy-Regularized Policy Iteration replaces the vanilla entropy regularization in Soft Policy Iteration with our risk-sensitive entropy regularization in Eq. 5 from distributional RL. In the policy evaluation step of distribution-entropy-regularized policy iteration, a new soft Qvalue, i.e., the expectation of Z π (s, a), can be computed iteratively by applying a modified Bellman operator T π d , which we call Distribution-Entropy-Regularized Bellman Operator defined as T π d Q (s t , a t ) r (s t , a t ) + γE st+1∼P (•|st,at) [V (s t+1 |s t , a t )] , where a new soft value function V (s t+1 |s t , a t ) conditioned on s t , a t is defined by V (s t+1 |s t , a t ) = E at+1∼π [Q (s t+1 , a t+1 )] + f (H (µ st,at , q st,at θ )), and f is a continuous increasing function over the cross entropy H. Note that in this specific tabular setting regarding s t and a t , we particularly use q st,at θ (x) to approximate the true density function of Z(s t , a t ), and µ st,at to represent the true target value distribution regardless of its expectation, which can normally be obtained via bootstrap estimate µ st+1,π Z (st+1) similar in Eq. 5. The f transformation over the cross entropy H between µ st,at and q st,at θ (x) serves as our risk-sensitive entropy regularization. As opposed to the vanilla entropy regularization in maximum entropy RL that encourages the policy to explore, our risk-sensitive entropy regularization in distributional RL plays a role of the reward correction or augmented reward, and therefore augments the action-value function Q(s t , a t ) in the value-based RL and the objective function in policy gradient RL by additionally incorporating the value distribution knowledge. As we have discussed Neural FZI above in Section 3.2, which is established on the value-based RL, we now shift our attention to the properties of our risk-sensitive entropy regularization in the framework of policy gradient. In Lemma 1, we firstly show that our Distribution-Entropy-Regularized Bellman operator T π d still inherits the convergence property in the policy evaluation phase. Lemma 1. (Distribution-Entropy-Regularized Policy Evaluation) Consider the distributionentropy-regularized Bellman operator T π d in Eq. 8 and the behavior of expectation of Z π (s, a), i.e., Q(s, a). Assume H(µ st,at , q st,at θ ) ≤ M for all (s t , a t ) ∈ S × A, where M is a constant. Define Q k+1 = T π d Q k , then Q k+1 will converge to a corrected Q-value of π as k → ∞ with the new objective function defined as J (π) = T t=0 E (st,at)∼ρπ [r (s t , a t ) + γf (H (µ st,at , q st,at θ ))] . In Lemma 1, we reveal that the new objective function for distributional RL can be interpreted as an augmented reward function. Secondly, in the policy improvement for distributional RL, we keep the vanilla policy improvement updating rules according to π new = arg max π ∈Π E at∼π [Q πold (s t , a t )] . Next we can immediately derive a new policy iteration algorithm, called Distribution-Entropy-Regularized Policy Iteration (DERPI) that alternates between distribution-entropy-regularized policy evaluation in Eq. 8 and the policy improvement in Eq. 11. It will provably converge to the policy with the optimal risk-sensitive entropy among all policies in Π as shown in Theorem 2. Theorem 2. (Distribution-Entropy-Regularized Policy Iteration) Assume H(µ st,at , q st,at θ ) ≤ M for all (s t , a t ) ∈ S × A, where M is a constant. Repeatedly applying distribution-entropy-regularized policy evaluation in Eq. 8 and the policy improvement in Eq. 11, the policy converges to an optimal policy π * such that Q π * (s t , a t ) ≥ Q π (s t , a t ) for all π ∈ Π. Please refer to Appendix H for the proof of Lemma 1 and Theorem 2. According to Theorem 2, it turns out that if we incorporate the risk-sensitive entropy regularization into the policy gradient framework in Eq. 10, we are able to design a variant of "soft policy iteration" that can guarantee the convergence to an optimal policy. As such, we provide a comprehensive comparison between vanilla entropy in maximum entropy RL and risk-sensitive entropy in distributional RL as follows. Figure 1 : q s,a θ is encouraged to disperse under the risk-sensitive entropy regularization of distributional RL. Vanilla Entropy Regularization vs Risk-Sensitive Entropy Regularization. (1) Objective function. By comparing two objective function J(π) in Eq. 7 for maximum entropy RL and J (π) in Eq. 10 for distributional RL, distributional RL tries to maximize the risksensitive entropy regularization w.r.t. π. This indicates that the learned policy in distributional RL is encouraged to visit state and action pairs in the future whose action-value distributions have a higher degree of dispersion, e.g., variance, in spite of its expectation, thus promoting the risk-sensitive exploration to reduce the intrinsic uncertainty of the environment. An intuitive illustration is provided in Figure 1 . (2) State-action dependent regularization. The vanilla entropy H(π(•|s t )) in maximum entropy RL is state-wise, while our risk-sensitive regularization H(µ st,at , q st,at θ ) is state-action-wise, implying that it is a more fine-grained regularization to characterize the action-value distribution of Z(s t , a t ) in the future.

3.4. ALGORITHM: DISTRIBUTION-ENTROPY-REGULARIZED ACTOR-CRITIC (DERAC)

In practice, large continuous domains require us to derive a practical approximation to DERPI. We thus extend DERPI from the tabular setting to the function approximation case, yielding the Distribution-Entropy-Regularized Actor-Critic (DERAC) algorithm by using function approximators for both the value distribution q θ (s t , a t ) and the policy π φ (a t |s t ). The key characteristics of DERAC algorithm is that we use function approximator to represent the whole value distribution q θ rather than only the value function, and conduct the optimization mainly based on the value function Q θ (s t , a t ) = E [q θ (s t , a t )]. Optimize the parameterized value distribution q θ . The new value function is originally trained to minimize the squared residual error of Eq. 8. Here for a desirable interpretation, we impose the zero expectation assumption over the residual, i.e., T π Q θ (s, a) = Q θ (s, a) + b with E [b] = 0. The resulting simplified objective function Ĵq (θ) can be well interpreted as an interpolation between the expectation effect and distributional regularization effect: Ĵq (θ) = E s,a (T π d Q θ * (s, a) -Q θ (s, a)) 2 ∝ (1 -λ)E s,a [(T π E [q θ * (s, a)]] -E [q θ (s, a)]) 2 + λE s,a [H(µ s,a , q s,a θ )] , where the results is simplified by using a particular increasing function f (H) = (λH) 1 2 /γ and λ ∈ [0, 1] is the hyperparameter that controls the risk-sensitive regularization effect. Interestingly, when we leverage the whole target density function p s,a to approximate the true µ s,a , the objective function in Eq. 12 can be viewed as an exact interpolation of loss functions between expectationbased RL (the first term) and categorical distributional RL loss (the second term), e.g., C51. Note that for the target T π E [q θ * (s, a)], we use the target value distribution neural network q θ * to stabilize the training, which is consistent with the Neural FZI framework analyzed in Section 3.1. Optimize the policy π φ . We optimize π φ in Eq. 11 based on the Q(s, a) and thus the new objective function Ĵπ (φ) can be expressed as Ĵπ (φ) = E s,a∼π φ [E [q θ (s, a)]]. The complete DERAC algorithm is described in Algorithm 1 of Appendix J.

4. EXPERIMENTS

In the experiment, we firstly verify the regularization effect of distributional RL analyzed in Section 3.2 by decomposing the action-value histogram density function via Eq. 5 on both Atari games and MuJoCo environments. Next, we demonstrate the convergence and favorable performance of DERAC algorithm on continual control environments. Finally, an empirical extension to Implicit Quantile Networks (IQN) is provided to reveal mutual impacts of different entropy regularizations. Environments. To demonstrate the value distribution decomposition, we mainly present results on three Atari games, including Breakout, Seaquest, Hero, over 3 seeds and three continuous control MuJoCo environments in OpenAI Gym, including ant, swimmer and bipedalwalkerhardcore, over 5 seeds. For the extension to IQN, we perform experiments on eight MuJoCo environments. Baselines. To evaluate the risk-sensitive entropy regularization effect of distributional RL, we conduct an ablation study on C51 (Bellemare et al., 2017a) on Atari games and distributional SAC (DSAC) (Ma et al., 2020) on MuJoCo environments. The implementation of DERAC algorithm is based on distributional SAC (Haarnoja et al., 2018; Ma et al., 2020) . More implementation details are provided in Appendix I.

4.1. DISTRIBUTION REGULARIZATION EFFECT OF DISTRIBUTIONAL RL

We demonstrate the rationale of action-value density function decomposition in Eq. 3 and the distribution regularization effect analyzed in Eq. 5 based on C51 algorithm equipped with KL divergence. Firstly, it is a fact that the value distribution decomposition is based on the equivalence between KL divergence and cross entropy owing to the leverage of target network. Hence, we demonstrate that C51 algorithm can still achieve similar results under the cross entropy loss across four Atari games in Figure 5 of Appendix K. In both the value-based C51 loss and the critic loss in DSAC with C51, we replace the whole target categorical distribution p s,a (x) in C51 with the derived µ s,a (x) under different ε in the cross entropy loss, allowing to investigate the risk-sensitive regularization effect of distributional RL. Concretely, we define ε as the proportion of probability of the bin that contains the expectation with the mass to transport to other bins. We use ε to replace for convenience as the leverage of ε can always guarantee the valid density function µ analyzed in Proposition 1. A large proportion probability ε that transports less mass to other bins, corresponds to a large in Eq. 3. 7LPH6WHSVH $YHUDJH5HWXUQ %UHDNRXW '41 & ( , q )( = 0.1) ( , q )( = 0.5) ( , q )( = 0.8) 7LPH6WHSVH $YHUDJH5HWXUQ 6HDTXHVW '41 & ( , q )( = 0.1) ( , q )( = 0.5) ( , q )( = 0.8) 7LPH6WHSVH $YHUDJH5HWXUQ +HUR '41 & ( , q )( = 0.1) ( , q )( = 0.5) ( , q )( = 0.8) 7LPH6WHSVH $YHUDJH5HWXUQ DQW 6$& '6$& ( , q )( = 0.1) ( , q )( = 0.5) ( , q )( = 0.8) 7LPH6WHSVH $YHUDJH5HWXUQ VZLPPHU 6$& '6$& ( , q )( = 0.1) ( , q )( = 0.5) ( , q )( = 0.8) 7LPH6WHSVH $YHUDJH5HWXUQ ELSHGDOZDONHUKDUGFRUH 6$& '6$& ( , q )( = 0.1) ( , q )( = 0.5) ( , q )( = 0.8) Figure 2 : (First Row) Learning curves of C51 with value distribution decomposition H(µ, q θ ) under different ε on three Atari games over 3 seeds. (Second Row) Learning curves of C51 with value distribution decomposition H(µ, q θ ) under different ε on three MuJoCo environments over 5 seeds. As shown in Figure 2 , when ε gradually decreases from 0.8 to 0.1, the learning curves of C51 H(µ, q θ ) tend to degrade from vanilla C51 to DQN across both Atari and MuJoCo, although their sensitivity in terms of ε may depend on the environment, e.g., bipedalwalkerhardcore. This empirical observation corroborates the theoretical results we derive in Section 3.2, suggesting that risk-sensitive entropy regularization is pivotal to the success of distributional RL algorithms. 

4.2. CONVERGENCE OF DERAC ALGORITHM

We further demonstrate the convergence of DERAC algorithm. Figure 3 showcases that DE-RAC converges and achieves desirable performance on MuJoCo environments compared with AC (SAC without vanilla entropy) in the blue line. More importantly, Distribution-Entropy-Regularization (DER) in the red line could be remarkably beneficial for learning on the complex Bipedalwalkerhardcore, where a risk-sensitive exploration significantly improves the performance. It is worthwhile to know that our goal to introduce DERAC algorithm is not to pursue the empirical superiority of performance, but to corroborate the theoretical convergence of DERAC algorithm and DERPI in Theorem 2. In addition, as we choose ε = 0.9 in DERAC algorithm, there exists a distribution information loss, resulting in the learning performance degradation, e.g., on Swimmer. In practice, we can directly deploy distributional SAC to seek for a better performance. We also provide a sensitivity analysis of DERAC regarding λ in Figure 6 of Appendix K.

4.3. EXTENSION TO QUANTILE-BASED DISTRIBUTIONAL RL

Finally, due to the fact that our aforementioned theoretical analysis is closely connected to categorical distributional RL algorithms, e.g., C51, in order to make a comprehensive conclusion in 7LPH6WHSVH For the implementation, we leverage the quantiles generation strategy in IQN (Dabney et al., 2018a) in distributional SAC (Ma et al., 2020) . Hyper-parameters are listed in Appendix I. As suggested in Figure 4 , although both vanilla entropy and risk-sensitive entropy effects may vary for different environments, we make the following conclusions: (1) Vanilla entropy effect can enhance the performance as it is easily observed that AC+VE (blue line) outperforms AC (red lines) across most environments except on the humanoid and swimmer. The risk-sensitive entropy effect (RE) from distributional RL is also able to benefit the learning due to the fact that AC+RE (black lines) is more likely to bypass AC (red lines) especially on the complex BipealWalkerHardcore environment (hard for exploration). (2) The use of both risk-sensitive entropy and vanilla entropy may interfere with each other, e.g., on BipealWalkerHardcore and Swimmer games, where AC+RE+VE (orange lines) is significantly inferior to AC+RE (black lines). This may results from the different exploration preference of two regularization effects. SAC encourages the policy to visit states with high entropy to pursue the diversity of states to optimize, while distributional RL promotes the risk-sensitive exploration to visit state and action pairs whose action-value distribution has larger degree of dispersion. We hypothesize that mixing two different exploration directions may lead to sub-optimal solutions in certain environments, thus interfering with each other eventually.

5. DISCUSSIONS AND CONCLUSION

Our regularization interpretation is based on histogram function equipped the KL divergence, strongly connected with categorical distributional RL. Although the histogram is linked with quantile function as well, a direct analysis based on quantile function is also promising in the future. In this paper, we illuminate the behavior difference of distributional RL over expectation-based RL from the perspective of regularization. A risk-sensitive entropy regularization is derived for distributional RL within Neural FZI to explain the potential advantage of distributional RL. We also establish a connection between distributional RL with maximum entropy RL. Our research contributes to a deeper understanding of the potential superiority of distributional RL algorithms. A CONVERGENCE GUARANTEE OF CATEGORICAL DISTRIBUTIONAL RL Categorical Distributional RL (Bellemare et al., 2017a) uses the heuristic projection operator Π C that was defined as Π C (δ y ) =    δ z1 y ≤ z 1 zi+1-y zi+1-zi δ zi + y-zi zi+1-zi δ zi+1 z i < y ≤ z i+1 δ z K y > z K , and extended affinely to finite mixtures of Dirac measures, so that for a mixture of Diracs N i=1 p i δ yi , we have Π C N i=1 p i δ yi = N i=1 p i Π C (δ yi ). The Cramér distance was recently studied as an alternative to the Wasserstein distances in the context of generative models (Bellemare et al., 2017b) . Recall the definition of Cramér distance. Definition 1. (Definition 3 (Rowland et al., 2018 )) The Cramér distance 2 between two distributions ν 1 , ν 2 ∈ P(R), with cumulative distribution functions F ν1 , F ν2 respectively, is defined by: 2 (ν 1 , ν 2 ) = R (F ν1 (x) -F ν2 (x)) 2 dx 1/2 . Further, the supremum-Cramér metric ¯ 2 is defined between two distribution functions η, µ ∈ P(R) X ×A by a) , µ (x,a) . ¯ 2 (η, µ) = sup (x,a)∈X ×A 2 η (x, Thus, the contraction of categorical distributional RL can be guaranteed under Cramér distance: Proposition 5. (Proposition 2 (Rowland et al., 2018 )) The operator Π C T π is a √ γ-contraction in ¯ 2 . An insight behind this conclusion is that Cramér distance endows a particular subset with a notion of orthogonal projection, and the orthogonal projection onto the subset is exactly the heuristic projection Π C (Proposition 1 in (Rowland et al., 2018) ).

B PROOF OF PROPOSITION 1

Proposition 1. Denote p s,a (x ∈ ∆ E ) = p E /∆. Following the density function decomposition in Eq. 3, µ(x) = N i=1 p µ i 1(x ∈ ∆ i )/∆ is a valid probability density function ⇐⇒ ≥ 1 -p E . Proof. Recap a valid probability density function requires non-negative and one-bounded probability in each bin and all probabilities should sum to 1. Necessity. (1) When x ∈ ∆ E , Eq. 3 can simplified as p E /∆ = (1 -)/∆ + p µ E /∆, where p µ E = µ(x ∈ ∆ E ). Thus, p µ E = p E -1-≥ 0 if ≥ 1 -p E . Obviously, p µ E = p E -1-≤ 1 -1-= 1 guaranteed by the validity of p s,a E . (2) When x / ∈ ∆ E , we have p i /∆ = p µ i /∆, i.e.,When x / ∈ ∆ E , We immediately have p µ i = pi ≤ 1-p E ≤ 1 when ≥ 1 -p E . Also, p µ i = pi ≥ 0. Sufficiency. (1) When x ∈ ∆ E , let p µ E = p E -1-≥ 0, we have ≥ 1 -p E . p µ E = p E -1-≤ 1 in nature. (2) When x / ∈ ∆ E , p µ i = pi ≥ 0 in nature. Let p µ i = pi ≤ 1, we have p i ≤ . We need to take the intersection set of (1) and (2), we find that ≥ 1 -p E ⇒ ≥ 1 -p E ≥ p i that satisfies condition in (2). Thus, the intersection set of (1) and ( 2) would be ≥ 1 -p E . In summary, as ≥ 1 -p E is both the necessary and sufficient condition, we have the conclusion that µ(x) is a valid probability density function ⇐⇒ ≥ 1 -p E .

C PROOF OF THEOREM 1

Theorem 1. Suppose p s,a (x) is Lipschitz continuous and the support of X is partitioned by N bins with bin size ∆. Then sup x | p s,a (x) -p s,a (x)| = O (∆) + O P log N n∆ 2 . ( ) Proof. Our proof mainly refers to (Wasserman, 2006) . In particular, the difference of p s,a (x)p s,a (x) can be written as p s,a (x) -p s,a (x) = E ( p s,a (x)) -p(x) bias + p s,a (x) -E ( p s,a (x)) stochastic variation . ( ) (1) The first bias term. Without loss of generality, we consider x ∈ ∆ k , we have E ( p s,a (x)) = P (X ∈ ∆ k ) ∆ = z0+k∆ z0+(k-1)∆ p(y)dy ∆ = F (z 0 + (k -1)∆) -F (z 0 + (k -1)∆) z 0 + k∆ -(z 0 + (k -1)∆) = p s,a (x ), where the last equality is based on the mean value theorem. According the L-Lipschitz continuity property, we have |E ( p s,a (x)) -p s,a (x)| = |p s,a (x ) -p s,a (x)| ≤ L|x -x| ≤ L∆ (2) The second stochastic variation term. If we let x ∈ ∆ k , then p s,a = p k = 1 n n i=1 1(X i ∈ ∆ k ), we thus have P sup x | p s,a (x) -E ( p s,a (x))| > = P max j=1,••• ,N 1 n n i=1 1 (X i ∈ ∆ j ) /∆ -P (X i ∈ ∆ j ) /∆ > = P max j=1,••• ,N 1 n n i=1 1 (X i ∈ ∆ j ) -P (X i ∈ ∆ j ) > ∆ ≤ N j=1 P 1 n n i=1 1 (X i ∈ ∆ j ) -P (X i ∈ ∆ j ) > ∆ ≤ N • exp -2n∆ 2 2 (by Hoeffding's inequality), where in the last inequality we know that the indicator function is bounded in [0, 1]. We then let the last term be a constant independent of N, n, ∆, thus, sup x | p s,a (x) -E ( p s,a (x))| = O P log N n∆ 2 In summary, as the above inequality holds for each x, we thus have the uniform convergence rate of a histogram density estimator sup Proof. For the histogram density estimator h θ and the true target density function p(x), we can simplify the KL divergence as follows. D KL (h, h θ ) = N i=1 zi zi-1 p i (x) ∆ log pi(x) ∆ h i θ ∆ dx = N i=1 zi zi-1 p i (x) ∆ log p i (x) ∆ dx - N i=1 zi zi-1 p i (x) ∆ log h i θ ∆ dx ∝ - N i=1 zi zi-1 p i (x) ∆ log h i θ ∆ dx = - N i=1 p i (x) log h i θ ∆ ∝ - N i=1 p i (x) log h i θ (21) where h i θ is determined by i and θ and is independent of x. For categorical distribution estimator c θ with the probability p i in for each atom z i , we also have its target categorical distribution p(x) with each probability p i , we have: D KL (c, c θ ) = N i=1 p i log p i c i θ = N i=1 p i log p i - N i=1 p i log c i θ ∝ - N i=1 p i log c i θ (22) In categorical distributional RL we only use a discrete categorical distribution with probabilities centered on the fixed atoms {z i } N i=1 , while the histogram density estimator in our analysis is a continuous function defined on [z 0 , z N ]. We reveal that minimizing the KL divergence regarding the parameterized categorical distribution in Eq. 22 is equivalent to minimizing the cross entropy loss regarding the parameterized histogram function in Eq. 21.

E PROPERTIES OF KL DIVERGENCE IN DISTRIBUTIONAL RL

Proposition 6. Given two probability measures µ and ν, we define the supreme D KL as a functional P(X ) S×A × P(X ) S×A → R, i.e., D ∞ KL (µ, ν) = sup (x,a)∈S×A D KL (µ(x, a), ν(x, a)). we have: (1) T π is a non-expansive distributional Bellman operator under D ∞ KL , i.e., D ∞ KL (T π Z 1 , T π Z 2 ) ≤ D ∞ KL (Z 1 , Z 2 ), (2) D ∞ KL (Z n , Z) → 0 implies the Wasserstein distance W p (Z n , Z) → 0, (3) the ex- pectation of Z π is still γ-contractive under D ∞ KL , i.e., ET π Z 1 -ET π Z 2 ∞ ≤ γ EZ 1 -EZ 2 ∞ . Proof. We firstly assume Z θ is absolutely continuous and the supports of two distributions in KL divergence have a negligible intersection (Arjovsky & Bottou, 2017) , under which the KL divergence is well-defined. (1) Please refer to (Morimura et al., 2012) for the proof. Therefore, we have D ∞ KL (T π Z 1 , T π Z 2 ) ≤ D ∞ KL (Z 1 , Z 2 ), implying that T π is a non-expansive operator under D ∞ KL . (2) By the definition of D ∞ KL , we have sup s,a D KL (Z n (s, a), Z(s, a)) → 0 implies D KL (Z n , Z) → 0. D KL (Z n , Z) → 0 implies the total variation distance δ(Z n , Z) → 0 according to a straightforward application of Pinsker's inequality δ (Z n , Z) ≤ 1 2 D KL (Z n , Z) → 0 δ (Z, Z n ) ≤ 1 2 D KL (Z, Z n ) → 0 Based on Theorem 2 in WGAN (Arjovsky et al., 2017) , δ(Z n , Z) → 0 implies W p (Z n , Z) → 0. This is trivial by recalling the fact that δ and W give the strong an weak topologies on the dual of (C(X ), • ∞ ) when restricted to Prob(X ). (3) The conclusion holds because the T π degenerates to T π regardless of the metric d p (Bellemare et al., 2017a) . Specifically, due to the linearity of expectation, we obtain that ET π Z 1 -ET π Z 2 ∞ = T π EZ 1 -T π EZ 2 ∞ ≤ γ EZ 1 -EZ 2 ∞ . This implies that the expectation of Z under D KL exponentially converges to the expectation of Z * , i.e., γ-contraction.

F PROOF OF PROPOSITION 3

Proposition 3 Denote q s,a θ (x) as the histogram density function of Z k θ (s, a) in Neural FZI. Based on the decomposition in Eq. 3 and KL divergence as d p , Neural FZI in Eq. 2 is simplified as Z k+1 θ = argmin q θ 1 n n i=1 -log q si,ai θ (∆ i E ) + αH( µ s i ,π Z (s i ) , q si,ai θ ) , Proof. Firstly, given a fixed p(x) we know that minimizing D KL (p, q θ ) is equivalent to minimizing H(p, q) by following D KL (p, q θ ) = N i=1 zi zi-1 p i (x)/∆ log p i (x)/∆ q i θ /∆ dx = - N i=1 zi zi-1 p i (x)/∆ log q i θ /∆ dx -( N i=1 zi zi-1 p i (x)/∆ log p i (x)/∆ dx) = H(p, q θ ) -H(p) ∝ H(p, q θ ) Based on H(p, q θ ), we use p s i ,π Z (s i ) (x) to denote the target probability density function of the random variable R(s i , a i ) + γZ k θ * (s i , π Z (s i )). Then, we can derive the objective function within where the cross entropy H( µ s i ,π Z (s i ) , q si,ai θ ) is based on the discrete distribution when i = 1, ..., N . ∆ i E represent the interval that E [Z π (s i , π Z (s i ))] falls into, i.e., E [Z π (s i , π Z (s i ))] ∈ ∆ i E .

G PROOF OF PROPOSITION 4

Proposition 4 In Eq. 2 of Neural FZI, if the function class {Z θ : θ ∈ Θ} is sufficiently large such that it contains {Y i } n i=1 , as ∆ → 0 (N → +∞), we have P (Z k+1 θ (s, a) = T opt Q k θ * (s, a)) = 1, where T opt Q k θ * (s, a) is the target in Eq. 1 of Neural FQI. Proof. Firstly, we define the distributional Bellman optimality operator T opt as follows: T opt Z(s, a) D = R(s, a) + γZ (S , a * ) S ∼ P (• | s, a), a * = argmax a E [Z (S , a )] If {Z θ : θ ∈ Θ} is sufficiently large enough such that it contains T opt Z θ * , then optimizing Neural FZI in Eq. 2 leads to Z k+1 θ = T opt Z θ * . We apply the action-value density function decomposition on the target histogram function p s,a (x). Consider the parameterized histogram density function h θ and denote h E θ /∆ as the bin height in the bin ∆ E , under the KL divergence between the first histogram function 1(x ∈ ∆ E ) with h θ (x), the objective function is simplified as D KL (1(x ∈ ∆ E )/∆, h θ (x)) ∝ - x∈∆ E 1 ∆ log h E θ ∆ dx ∝ -log h E θ Since {Z θ : θ ∈ Θ} is sufficiently large enough, the KL minimizer would be h where r π (s t , a t ) r (s t , a t ) + γf (H (µ st,at , q st,at θ θ = 1(x ∈ ∆ E )/∆ in expectation. Then, arg min h θ lim ∆→0 D KL (1(x ∈ ∆ E )/∆, h θ (x)) = δ E[Z target (s, )) is the entropy augmented reward we redefine. Applying the standard convergence results for policy evaluation (Sutton & Barto, 2018) , we can attain that this Bellman updating under T π d is convergent under the assumption of |A| < ∞ and bounded entropy augmented rewards r π .

H.2 POLICY IMPROVEMENT WITH PROOF

Lemma 2. (Distribution-Entropy-Regularized Policy Improvement) Let π ∈ Π and a new policy π new be updated via the policy improvement step in Eq. 11. Then Q πnew (s t , a t ) ≥ Q π old (s t , a t ) for all (s t , a t ) ∈ S × A with |A| ≤ ∞. Proof. The policy improvement in Lemma 2 implies that E at∼πnew [Q πold (s t , a t )] ≥ E at∼πold [Q πold (s t , a t )], we consider the Bellman equation via the distribution-entropy-regularized Bellman operator T π sd : Q πold (s t , a t ) r (s t , a t ) + γE st+1∼ρ [V πold (s t+1 )] = r (s t , a t ) + γf (H (µ st,at , q st,at θ )) + γE (st+1,at+1)∼ρ π old [Q πold (s t+1 , a t+1 )] ≤ r (s t , a t ) + γf (H (µ st,at , q st,at θ )) + γE (st+1,at+1)∼ρ πnew [Q πold (s t+1 , a t+1 )] = r π new (s t , a t ) + γE (st+1,at+1)∼ρ πnew [Q πold (s t+1 , a t+1 )] . . . ≤ Q πnew (s t+1 , a t+1 ) , where we have repeated expanded Q πold on the RHS by applying the distribution-entropy-regularized distributional Bellman operator. Convergence to Q πnew follows from Lemma 1. leverage the quantile fraction generation based on IQN (Dabney et al., 2018a) that uniformly samples quantile fractions in order to approximate the full quantile function. In particular, we fix the number of quantile fractions as N and keep them in an ascending order. Besides, we adapt the sampling as τ 0 = 0, τ i = i / N -1 i=0 , where i ∈ U [0, 1], i = 1, ..., N . I.1 HYPER-PARAMETERS AND NETWORK STRUCTURE. We adopt the same hyper-parameters, which is listed in Table 1 and network structure as in the original distributional SAC paper (Ma et al., 2020) .

J DERAC ALGORITHM

Algorithm 1 Distribution-Entropy-Regularized Actor Critic (DERAC) Algorithm 1: Initialize two value networks q θ , q θ * , and policy network π φ . 2: for each iteration do 3: for each environment step do 4: a t ∼ π φ (a t |s t ).

5:

s t+1 ∼ p(s t+1 |s t , a t ).

6:

D ← D ∪ {(s t , a t , r (s t , a t ) , s t+1 )} Figure 6 shows that DERAC with different λ in Eq. 12 may behave differently on the different environment. Learning curves of DERAC with an increasing λ will tend to DSAC (C51), e.g.,



Figure 3: Learning curves of DERAC algorithms on three MuJoCo environments over 5 seeds.

p s,a (x) -p s,a (x)| ≤ sup x |E ( p s,a (x)) -p s,a (x)| + sup x | p s,a (x) -E ( p s,a (x))| = O (∆) + O P log N n∆ 2 .Suppose the target categorical distribution c = N i=1 p i δ zi and the target histogram function h(x) = N i=1 p i 1(x ∈ ∆ i )/∆, updating the parameterized categorical distribution c θ under KL divergence is equivalent to updating the parameterized histogram function h θ .

a)] , where δ E[Z target (s,a)] is a Dirac Delta function centered at E [Z target (s, a)] and can be viewed as a generalized probability density function. The limit behavior from a histogram function p to a continuous one for Z target is guaranteed by Theorem 1, and this also applies from h θ to Z θ . In Neural FZI, we have Z target = T opt Z θ * . According to the definition of Dirac function, as ∆ → 0, we attainP (Z k+1 θ (s, a) = E T opt Z k θ * (s, a)

← θ -λ q ∇ θ Ĵq (θ) 10: φ ← φ + λ π ∇ φ Ĵπ (φ).

Figure5suggests that C51 with cross entropy loss behaves similarly to the vanilla C51 equipped with KL divergence.

We analyze that histogram density estimate plays an intermediate role within these two branches. (1) Connection to Categorical Distributional RL. Although the continuous histogram density estimator is in contrast to the discrete categorical distribution, in Proposition 2 (proof in Appendix D) we reveal that minimizing the KL divergence between the target density function and the histogram density estimate is equivalent to the parameterized categorical distribution . (2) Connection to Quantile-based Distributional RL. Histogram and quantile functions are "two sides of a coin". The histogram estimates the density function by giving each bin an equal amount of information, while the quantile function gives each fraction of data the same amount of information. Based on this insight, we thus argue that our analysis based on the histogram density estimate is largely general and representative in distributional RL families.

Figure 4: Learning curves of AC, AC+VE, AC+RE and AC+RE+VE over 5 seeds with smooth size 5 across eight MuJoCo environments where distributional RL part is based on IQN.

annex

Ethics Statement. As our study is related to reveal the regularization effect of distributional RL algorithms, it is not involved with any ethics issue in our opinion.Reproducibility Statement. Our results is based on the public implementation released in (Ma et al., 2020) with necessary implementation details given in Appendix I. We also provide the detailed proof in Appendix.

H.3 PROOF OF SOFT DISTRIBUTIONAL POLICY ITERATION IN THEOREM 2

Theorem 2 (Distribution-Entropy-Regularized Policy Iteration) Assume H(µ st,at , q st,at θ ) ≤ M for all (s t , a t ) ∈ S × A, where M is a constant. Repeatedly applying distribution-entropy-regularized policy evaluation in Eq. 8 and the policy improvement in Eq. 11, the policy converges to an optimal policy π * such that Q π * (s t , a t ) ≥ Q π (s t , a t ) for all π ∈ Π.Proof. The proof is similar to soft policy iteration (Haarnoja et al., 2018) . For the completeness, we provide the proof here. By Lemma 2, as the number of iteration increases, the sequence Q πi at i-th iteration is monotonically increasing. Since we assume the risk-sensitive entropy is bounded by M , the Q π is thus bounded as the rewards are bounded. Hence, the sequence will converge to some π * . Further, we prove that π * is in fact optimal. At the convergence point, for all π ∈ Π, it must be case that:According to the proof in Lemma 2, we can attain Q π * (s t , a t ) > Q π (s t , a t ) for (s t , a t ). That is to say, the "corrected" value function of any other policy in Π is lower than the converged policy, indicating that π * is optimal.

I IMPLEMENTATION DETAILS

Our implementation is directly adapted from the source code in (Ma et al., 2020) .For Distributional SAC with C51, we use 51 atoms similar to the C51 (Bellemare et al., 2017a) . For distributional SAC with quantile regression, instead of using fixed quantiles in QR-DQN, we Bipedalwalkerhardcore, where DERAC with λ = 1 in the green line tends to DSAC (C51) in the blue line. However, DERAC with a small λ is likely to outperform DSAC (C51) by only leverage the expectation effect of value distribution, e.g., on Bipedalwalkerhardcore, where DERAC with λ = 0, 0.5 bypass DERAC with λ = 1.0. 

