HOW DOES VALUE DISTRIBUTION IN DISTRIBU-TIONAL REINFORCEMENT LEARNING HELP OPTI-MIZATION?

Abstract

We consider the problem of learning a set of probability distributions from the Bellman dynamics in distributional reinforcement learning (RL) that learns the whole return distribution compared with only its expectation in classical RL. Despite its success to obtain superior performance, we still have a poor understanding of how the value distribution in distributional RL works. In this study, we analyze the optimization benefits of distributional RL by leveraging its additional value distribution information over classical RL in the Neural Fitted Z-Iteration (Neural FZI) framework. To begin with, we demonstrate that the distribution loss of distributional RL has desirable smoothness characteristics and hence enjoys stable gradients, which is in line with its tendency to promote optimization stability. Furthermore, the acceleration effect of distributional RL is revealed by decomposing the return distribution. It turns out that distributional RL can perform favorably if the value distribution approximation is appropriate, measured by the variance of gradient estimates in each environment for any specific distributional RL algorithm. Rigorous experiments validate the stable optimization behaviors of distributional RL, contributing to its acceleration effects compared to classical RL. The findings of our research illuminate how the value distribution in distributional RL algorithms helps the optimization.

1. INTRODUCTION

Distributional reinforcement learning (Bellemare et al., 2017a; Dabney et al., 2018b; a; Yang et al., 2019; Zhou et al., 2020; Nguyen et al., 2020; Luo et al., 2021; Sun et al., 2022) characterizes the intrinsic randomness of returns within the framework of Reinforcement Learning (RL). When the agent interacts with the environment, the intrinsic uncertainty of the environment seeps in the the stochasticity of rewards the agent receives and the inherently chaotic state and action dynamics of physical interaction, increasing the difficulty of the RL algorithm design. Distributional RL is aimed at representing the entire distribution of returns in order to capture more intrinsic uncertainty of the environment, and therefore to use these value distributions to evaluate and optimize the policy. This is in stark contrast to the classical RL that only focuses on the expectation of the return distributions, such as temporal-difference (TD) learning (Sutton & Barto, 2018) and Q-learning (Watkins & Dayan, 1992) . As a promising branch of RL algorithms, distributional RL has demonstrated the state-of-the-art performance in a wide range of environments, e.g., Atari games, in which the representation of return distributions and the distribution divergence between the current and target return distributions within each Bellman update are pivotal to its empirical success (Dabney et al., 2018a; Sun et al., 2021b; 2022) . Specifically, categorical distributional RL, e.g., C51 (Bellemare et al., 2017a; Rowland et al., 2018) , integrates a categorical distribution by approximating the density probabilities in pre-specified bins with a bounded range and Kullback-Leibler (KL) divergence, serving as the first successful distributional RL family in recent years. Quantile Regression (QR) distributional RL, e.g., QR-DQN (Dabney et al., 2018b) , approximates Wasserstein distance by the quantile regression loss and leverages quantiles to represent the whole return distribution. Other variants of QR-DQN, including Implicit Quantile Networks (IQN) (Dabney et al., 2018a) and Fully parameterized Quantile Function (FQF) (Yang et al., 2019) , can even achieve significantly better performance across plenty of Atari games. Moment Matching distributional RL (Nguyen et al., 2020) learns deterministic samples to evaluate the distribution distance based on Maximum Mean Discrepancy, while a more recent work called Sinkhorn distributional RL (Sun et al., 2022) interpolates Maximum Mean Discrepancy and Wasserstein distance via Sinkhorn divergence (Sinkhorn, 1967) . Meanwhile, distributional RL also inherits other benefits in risk-sensitive control (Dabney et al., 2018a) , policy exploration settings (Mavrin et al., 2019; Rowland et al., 2019) and robustness (Sun et al., 2021a) . Despite the remarkable empirical success of distributional RL, the illumination on its theoretical advantages is still less studied. A distributional regularization effect (Sun et al., 2021b) stemming from the additional value distribution knowledge has been characterized to explain the superiority of distributional RL over classical RL, but the benefit of the proposed regularization on the optimization of algorithms has not been investigated as the optimization plays a key role in RL algorithms. In the literature of strategies that can help the learning in RL, recent progresses mainly focus on the policy gradient methods (Sutton & Barto, 2018) . Mei et al. (2020) show that the policy gradient with a softmax parameterization converges at a O(1/t) rate, with constants depending on the problem and initialization, which significantly expands the existing asymptotic convergence results. Entropy regularization (Haarnoja et al., 2017; 2018) has gained increasing attention as it can significantly speed up the policy optimization with a faster linear convergence rate (Mei et al., 2020) . Ahmed et al. (2019) provide a fine-grained understanding on the impact of entropy on policy optimization, and emphasize that any strategy, such as entropy regularization, can only affect learning in one of two ways: either it reduces the noise in the gradient estimates or it changes the optimization landscape. These commonly-used strategies that accelerate RL learning inspire us to further investigate the optimization impact of distributional RL arising from the exploitation of return distributions. In this paper, we study the theoretical superiority of distributional RL over classical RL from the optimization standpoint. We begin by analyzing the optimization impact of different strategies within the Neural Fitted Z-Iteration (Neural FZI) framework and point out two crucial factors that contribute to the optimization of distributional RL, including the distribution divergence and the distribution parameterization error. The smoothness property of distributional RL loss function has also been revealed leveraging the categorical parameterization, yielding its stable optimization behavior. The uniform stability in the optimization process can thus be more easily achieved for distributional RL in contrast to classical RL. In addition to the optimization stability, we also elaborate the acceleration effect of distributional RL algorithms based on the value distribution decomposition technique proposed recently. It turns out that distributional RL can be shown to speed up the convergence and perform favorably if the value distribution is approximated appropriately, which is measured by the variance of gradient estimates. Empirical results corroborate that distributional RL indeed enjoys a stable gradient behavior by observing smaller gradient norms in terms of the observations the agent encounters in the learning process. Besides, the variance reduction of gradient estimates for distributional RL algorithms with respect to network parameters also provides strong evidence to demonstrate the smoothness property and acceleration effects of distributional RL. Our study opens up many exciting research pathways in this domain through the lens of optimization, paving the way for future investigations to reveal more advantages of distributional RL.

2. PRELIMINARY KNOWLEDGE

Classical RL. In a standard RL setting, the interaction between an agent and the environment is modeled as a Markov Decision Process (MDP) (S, A, R, P, γ), where S and A denote state and action spaces. P is the transition kernel dynamics, R is the reward measure and γ ∈ (0, 1) is the discount factor. For a fixed policy π, the return, Z π = ∞ t=0 γ t R t , is a random variable representing the sum of discounted rewards observed along one trajectory of states while following the policy π. Classical RL focuses on the value function and action-value function, the expectation of returns Z π . The action-value function Q π (s, a) is defined as Q π (s, a) = E [Z π (s, a)] = E [ ∞ t=0 γ t R (s t , a t )], where s 0 = s, a 0 = a, s t+1 ∼ P (•|s t , a t ), and a t ∼ π(•|s t ). Distributional RL. Distributional RL, on the other hand, focuses on the action-value distribution, the full distribution of Z π (s, a) rather than only its expectation, i.e., Q π (s, a). Leveraging knowledge on the entire value distribution can better capture the uncertainty of returns and thus can be advantageous to explore the intrinsic uncertainty of the environment (Dabney et al., 2018a; Mavrin et al., 2019) . The scalar-based classical Bellman updated is therefore extended to distributional Bellman update, which allows a flurry of distributional RL algorithms, mainly including Categorical distributional RL (Bellemare et al., 2017a) and Quantie Regression Distributional RL (Dabney et al., 2018b; a) .

Categorical Distributional RL.

As the first successful distributional RL family, Categorical distributional RL (Bellemare et al., 2017a) approximates the action-value distribution η by a categorical distribution η = k i=1 f i δ li where l 1 , l 2 , ..., l k is a set of fixed supports and {f i } k i=1 are learnable probabilities, normally parameterized by a neural network. A projection is also introduced to have the joint support with a newly distributed target probabilities, equipped by a KL divergence to compute the distribution distance between the current and target value distribution within each Bellman update. In practice, C51 (Bellemare et al., 2017a) , an instance of Categorical Distributional RL with k = 51, performs favorably on a wide range of environments. Quantile Regression (QR) Distributional RL. QR Distributional RL (Dabney et al., 2018b; a) approximates the value distribution η by a mixture of Dirac η = 1 N N i=1 δ τi , where τ i = F -1 η ( 2i-1 2N ) are the learnable quantile values at the fixed quantiles { 2i-1 2N } and F -1 is the inverse cumulative distribution function of η. Since the quantile regression loss proposed in QR distributional RL can be used to approximate the Wasserstein distance, it gains favorable performance on Atari games. Moreover, the performance has been further improved by a series of variants based on quantile regression loss (Dabney et al., 2018a; Yang et al., 2019; Zhou et al., 2020) . For example, Implicit Quantile Network (IQN) (Dabney et al., 2018a) utilizes a continuous mapping for the quantile function F -1 η ( 2i-1 2N ) rather than a fixed set of quantiles, which expands the expressiveness power of function approximators to represent the value distribution.

3. OPTIMIZATION EFFECT OF DISTRIBUTIONAL RL

We consider the function approximation setting to analyze the optimization benefit of distributional RL. In Section 3.1, we begin by showing the different roles of components in distributional RL on the entire optimization of RL algorithms within the Neural FZI framework. Further, in Section 3.2 we reveal the desirable smoothness properties of distributional RL loss function as opposed to classical RL, contributing to the stable optimization. Finally, the acceleration effect of distributional RL stemming from the additional value distribution is analyzed in Section 3.3, which is characterized by the variance of gradient estimates.

3.1. HOW TO OPTIMIZE NEURAL FITTED Z-ITERATION FOR DISTRIBUTIONAL RL?

In classical RL, Neural Fitted Q-Iteration (Neural FQI) (Fan et al., 2020; Riedmiller, 2005) provides a statistical interpretation of DQN (Mnih et al., 2015) while capturing its two key features, i.e., the leverage of target network and the experience replay: Q k+1 θ = argmin Q θ 1 n n i=1 y i -Q k θ (s i , a i ) 2 , where the target y i = r(s i , a i ) + γ max a∈A Q k θ * (s i , a ) is fixed within every T target steps to update target network Q θ * by letting θ * = θ. The experience buffer induces independent samples {(s i , a i , r i , s i )} i∈ [n] and ideally without the optimization and TD approximation errors, Neural FQI is exactly the update under Bellman optimality operator (Fan et al., 2020) . Similarly, in distributional RL, Sun et al. (2021b) ; Ma et al. (2021) proposed Neural Fitted Z-Iteration (Neural FZI), a distributional version of Neural FQI based on the parameterization of Z θ : Z k+1 θ = argmin Z θ 1 n n i=1 d p (Y i , Z k θ (s i , a i )), where the target Y i = R(s i , a i ) + γZ k θ * (s i , π Z (s i ) ) is a random variable, whose distribution is also fixed within every T target steps. The target follows a greedy policy rule, where π Z (s i ) = argmax a E Z k θ * (s i , a ) and d p is the choice of distribution distance. Within the Neural FZI process, we can easily perceive that there are mainly two crucial components that determine the comprehensive optimization of distributional RL algorithms. et al., 2017a) . Apart from the impact on the distributional Bellman update speed, d p also largely affects the continuous optimization problem to estimate parameter θ in Z θ within each iteration of Neural FZI, including the convergence speed and the bad or good local minima issues. • The parameterization manner of Z θ . The distribution representation way of d p plays an integral part of the optimization for deep RL algorithms. For example, with more expressiveness power on quantile functions, IQN outperforms QR-DQN on a wider range of Atari games, which is intuitive as a more informative representation way can approximate the true value distribution more reasonably. A smaller value distribution parameteriation error is also potential to help the optimization albeit in an indirect avenue. Owing to the fact that convergence rates of distributional Bellman update under typical d p are basically known, our optimization analysis mainly focuses on the impact of d p and the paramterization error of Z θ on the continuous optimization within Neural FZI of distributional RL by comparing Neural FQI of classical RL. In Sections 3.2 and 3.3, we attribute the optimization benefits of distributional RL to its distribution objective function, consisting of the aforementioned two factors, as opposed to the vanilla least squared loss in classical RL.

3.2. STABLE OPTIMIZATION ANALYSIS BASED ON CATEGORICAL PARAMETERIZATION

To allow for a theoretical analysis, we resort to the categorical parameterization equipped with KL divergence in categorical distributional RL (Bellemare et al., 2017a) , e.g., C51, in order to investigate the stable optimization properties within each iteration in Neural FZI. Concretely, we assume Z θ is absolutely continuous and the current and target value distributions under KL divergence within a bounded range have joint supports (Arjovsky & Bottou, 2017) , under which the KL divergence is well-defined. Note that this analysis strategy is slightly different from vanilla Categorical distributional RL, which also introduces a projection to redistribute probabilities of target value distribution by the neighboring smoothing without the joint support assumption. We slightly simplify Categorical distributional RL by assuming that the target distribution is still within the pre-specified support, which is still easy to satisfy in practice given a relative large bounded range [l 0 , l k ] in advance. To approximate the categorical distribution, we leverage the histogram function f s,a with k uniform partitions on the support to parameterize the approximated probability density function of Z(s, a). With KL divergence as d p , we can eventually derive the distribution objective function to be optimized within each update in Neural FZI, which is similar to the histogram distributional loss proposed in (Imani & White, 2018) . In particular, we denote x(s) as the state feature on each state s, and we let the support of Z(s, a) be uniformly partitioned into k bins. The output dimension of f s,• can be |A| × k, where we use the index a to focus on the function f s,a . Hence, the function f s,a : X → [0, 1] k provides a k-dimensional vector f s,a (x(s)) of the coefficients, indicating the probability that the target is in this bin given the state feature x(s) and action a. Next, we use softmax based on the linear approximation x(s) θ i to express f s,a , i.e., f s,a,θ i (x(s)) = exp x(s) θ i / k j=1 exp x(s) θ j . For simplicity, we use f θ i (x(s)) to replace f s,a,θ i (x(s)). Note that the form of f s,a is similar to that in Softmax policy gradient optimization (Mei et al., 2020; Sutton & Barto, 2018) , but here we focus on the value-based RL rather than the policy gradient RL. Our prediction probability f s,a i is redefined as the probability in the i-th bin over the support of Z(s, a), thus eventually serving as a density function. While the linear approximator is clearly limited, this is the setting where so far the cleanest results have been achieved and understanding this setting is a necessary first step towards the bigger problem of understanding distributional RL algorithms. Under this categorical parameterization equipped with KL divergence, the resulting distributional objective function L θ (s, a) for the continuous optimization for each s, a pair in each iteration of Neural FZI (Eq. 2) can be expressed as: L θ (s, a) = - k i=1 li+wi li p s,a (y) log f θ i (x(s)) w i dy ∝ - k i=1 p s,a i log f θ i (x(s)), where θ = {θ 1 , ..., θ k } and p s,a i is the probability in the i-th bin of the true density function p s,a (x) for Z(s, a) defined in Eq. 6. w i is the width for the i-th bin (l i , l i+1 ]. The derivation of the categorical distributional loss under the categorical parameterization is given in Appendix A. To attain the stable optimization property of distributional RL, we firstly derive the appealing properties of the new categorical distributional loss in Eq. 3, as shown in Proposition 1. Proposition 1. (Properties of Categorical Distributional Loss) Assume the state features x(s) ≤ l for each state s, then L θ is kl-Lipschitz continuous, kl 2 -smooth and convex w.r.t. the parameter θ. Please refer to Appendix B for the proof. The derived smoothness properties of d p under the categorical distributional loss plays an integral role in the stable optimization for distributional RL. In stark contrast, classical RL optimizes a least squared loss function (Sutton & Barto, 2018) in Neural FQI. It is known that the least squared estimator has no bounded Lipschitz constant in general and is only λ max -smooth, where λ max is the largest singular value of the design or data matrix. More specifically, for the categorical distributional loss in distributional RL, we have ∇ θ L θ ≤ kl, while the gradient norm in classical RL is |y i -Q k θ (s, a)| x(s) , where Q k θ (s, a) = k i=1 (l i + l i+1 )f θ i (x(s) )/2w i under the same categorical parameterization for a fair comparison. Clearly, Q k θ (s, a) can be sufficiently large if the support [l 0 , l k ] is specified to be large, which is common in environments where the agent is able to attain a high level of expected returns (Bellemare et al., 2017a ). As such, |y i -Q k θ (s, a)| can vary significantly more than k and therefore classical RL with the potentially larger upper bound of gradient norms is prone to the instability optimization issue. After providing the intuitive comparison in terms of gradient norms above, we next show that distributional RL loss can induce an uniform stability property under the desirable smoothness properties analyzed in Proposition 1. We recap the definition of uniform stability for an algorithm while running Stochastic Gradient Descent (SGD) in Definition 1. Definition 1. (Uniform Stability) (Hardt et al., 2016) Consider a loss function g w (z) parameterized by w encountered on the example z, a randomized algorithm M is uniformly stable if for all data sets D, D such that D, D differ in at most one example, we have sup z E M g M(D) (z) -g M(D ) (z) ≤ stab . In Theorem 1, we show that while running SGD to solve the categorical distributional loss within each Neural FZI, the continuous optimization process in each iteration is stab -uniformly stable. Theorem 1. (Stable Optimization for Distributional RL) Suppose that we run SGD under L θ in Eq. 3 with step sizes λ t ≤ 2/kl 2 for T steps. Assume x(s) ≤ l for each state s and action a, then we have L θ satisfies the uniform stability in Definition 1 with stab ≤ 4kT n , i.e., E L θ T (s, a) -L θ T (s, a) ≤ 4kT n , where θ T and θ T are the minimizers after T steps under the dataset D and D , respectively. Please refer to the proof of Theorem 1 in Appendix C. The stable optimization has multiple advantages. In deep learning optimization literature (Hardt et al., 2016) , an uniform stability can guarantee stab -bounded generalization gap. In reinforcement learning, algorithms with more stability tend to achieve a better final performance (Bjorck et al., 2021; Li & Pathak, 2021; Ahmed et al., 2019) . In summary, under the categorical parameterization equipped with KL divergence, the continuous optimization objective function within each update of Neural FZI for distributional RL is uniformly stable with the stability errors shrinking at the rate of O(n -1 ), and the immediately obtained bounded generalization gap also guarantees a desirable local minima. This advantage can be owing to the desirable smoothness property of categorical distributional loss with a potentially smaller upper bound of gradient norms compared with classical RL. Empirically, in Section 4, we validate the stable gradient behaviors of categorical distributional RL, and similar results are also observed in Quantile Regression distributional RL. By contrast, without these smooth properties, classical RL may not yield the stable optimization property directly. For example, λ max -smooth may be of less help for the optimization given a bad conditional number of the design matrix where λ max could be sufficiently large. The potential optimization instability for classical RL can be used to explain its inferiority to distributional RL in most environments, although it may not explain why distributional RL could not perform favorably in certain games (Ceron & Castro, 2021) . We leave the comprehensive explanation as future works. Remark on Non-linear Categorical Parameterization. Although the aforementioned stability optimization conclusions are established on the linear categorical parameterization on the value distribution of Z π . Similar conclusions can be extended in the non-convex optimization case with a non-linear categorical parameterization by techniques proposed in (Hardt et al., 2016) . We also empirically validate our theoretical conclusions in the experiments by directly applying practical neural network parameterized distributional RL algorithms.

3.3. ACCELERATION EFFECT OF DISTRIBUTIONAL RL

To characterize the acceleration effect of distributional RL, we additionally leverage the recently proposed value distribution decomposition (Sun et al., 2021b) to decompose the target p s,a . Value Distribution Decomposition. In order to decompose the optimization impact of value distribution into its expectation and the remaining distribution part, we adopt the wisdom from robust statistics via a variant of gross error model (Huber, 2004) . Value distribution decomposition (Sun et al., 2021b) was successfully applied to derive the distributional regularization effect of distributional RL. We utilize F s,a to express the distribution function of Z π (s, a) and we consider the function class of F s,a that satisfies the following expectation decomposition: F s,a (x) = (1 -)1 {x≥E[Z π (s,a)]} (x) + F s,a µ (x), where the distribution function F s,a µ is determined by F s,a and to measure the impact of remaining distribution independent of its expectation E [Z π (s, a)]. controls the proportion of F s,a µ (x) and the indicator function 1 {x≥E[Z π (s,a)]} = 1 if x ≥ E [Z π (s, a) ], otherwise 0. Although the function class of F s,a is restricted to satisfy this decomposition equality, it is still rich with the rationale rigorously demonstrated in (Sun et al., 2021b) . To reveal the speeding up effect of distributional RL loss, we consider the density function form of Eq. 6, i.e., p s,a ( s, a) ] to characterize the expectation impact and µ s,a is the density function of F s,a µ to measure the addition value distribution information. x) = (1 -)δ {x=E[Z π (s,a)]} (x) + µ s,a (x), where δ {x=E[Z π (s,a)]} is a Dirac function centered at E [Z π ( Within Neural FZI, our goal is to minimize 1 n n i=1 L θ (s i , a i ). We rewrite L θ (s, a) as L θ (g s,a , f s,a θ ), where the target density function g s,a can be p s,a , µ s,a or δ {x=E[Z π (s,a)]} , and f s,a,θ is rewritten as f s,a θ for conciseness. We denote G k (θ) = E L θ (δ {x=E[Z π (s,a)]} , f s,a θ ) and use G(θ) for G k (θ) for simplicity. Based on the categorical parameterization in Section 3.2, the convex and smooth properties with respect to the parameter θ in f θ as shown in Proposition 1 still hold for G(θ). As the KL divergence enjoys the property of unbiased gradient estimates, we let the variance of its stochastic gradient over the expectation δ {x=E[Z π (s,a)]} be bounded, i.e., E (s,a)∼ρ π ∇L θ (δ {x=E[Z π (s,a)]} , f s,a θ )) -∇G(θ) 2 = σ 2 . ( ) Next, following the similar label smoothing analysis in (Xu et al., 2020) , we further characterize the approximation degree of f s,a θ to the target value distribution µ s,a by measuring its variance as κσ 2 : E (s,a)∼ρ π ∇L θ (µ s,a , f s,a θ )) -∇G(θ) 2 = σ2 := κσ 2 . ( ) Notably, κ can be used to measure the approximation error between f s,a θ and µ s,a and we do not assume σ2 to be bounded as κ can be arbitrarily large. This expression κσ 2 for σ2 allows us to utilize κ to characterize different acceleration effects for distributional RL given different κ. Concretely, a favorable approximation of f s,a θ to µ s,a would lead to a small κ that contributes to the acceleration effect of distributional RL as shown in Theorem 2. Proposition 2. Based on the value distribution decomposition in Eq. 6, and Eq. 8, we have: E (s,a)∼ρ π ∇L θ (p s,a , f s,a θ )) -∇G(θ) 2 ≤ (1 -) 2 σ 2 + 2 κσ 2 . ( ) Based on Eq. 8, we immediately have Proposition 2 with proof in Appendix D for the proof. Before comparing the sample complexity in the optimization process of both classical and distributional RL, we provide the definition of the first-order τ -stationary point, which is preferred in the optimization of deep learning rather than the a simple stationary point in order to guarantee the generalization. Definition 2. (First-order τ -Stationary Point) While solving min θ G(θ), the updated parameters θ T after T steps is a first-order τ -stationary point if ∇G(θ T ) ≤ τ , where the small τ is in (0, 1). Based on Definition 2, we formally characterize the acceleration effects for distributional RL in Theorem 2 that depends upon approximation errors between µ s,a and f s,a θ measured by κ. Theorem 2. (Sample Complexity and Acceleration Effects of Distributional RL) While running SGD to minimize L θ in Eq. 6 within Neural FZI, we assume the step size λ = 1/kl 2 , = 1/(1 + κ) across ( 2) and (3), and the sample is uniformly drawn from T samples, then: (1) (Classical RL) When minimizing L θ (δ {x=E[Z π (s,a)]} , f s,a θ ), T = O( 1 τ 4 ) such that L θ converges to a τ -stationary point in expectation. (2) (Distributional RL with κ ≤ τ 2σ ) When minimizing L θ (p s,a , f s,a θ ), let T = 4G(θ0) λτ 2 = O( 1 τ 2 ), L θ converges to a τ -stationary point in expectation. (3) (Distributional RL with κ > τ 2σ ) When minimizing L θ (p s,a , f s,a θ ), let T = G(θ0) λκ 2 σ 2 = O( 1 τ 2 ), L θ does not converge to a τ -stationary point, but can guarantee a O(κ 2 )-stationary point. The proof is provided in Appendix E. Theorem 2 is inspired by the intuitive connection between the value distribution in distributional RL and the label distribution in label smoothing technique (Xu et al., 2020) . Importantly, Theorem 2 demonstrates that solving categorical distributional loss of distributional RL can speed up the convergence if a distribution approximation error is favorable. Otherwise, the convergence point, albeit stationary, may not guarantee a desirable performance under an agnostic κ, which may be very large on certain environments. In Classical RL scenario, we provide an equivalence between L θ (δ {x=E[Z π (s,a)]} , f s,a θ ) and mean squared error loss (Eq. 1) in Neural FQI in Appendix H. In the first scenario ((2) in Theorem 2), there is only a small approximation or paramterization error between f s,a θ and p s,a (or µ s,a ), corresponding to a small κ with κ ≤ τ 2σ . In this case, solving L θ based on the categorical parameterization can reduce the sample complexity from O( 1 τ 4 ) to O( 1 τ 2 ) compared with classical RL in (1) of Theorem 2, and meanwhile guarantees a τ -stationary point. In the second scenario ((3) in Theorem 2) especially for some challenging environments with much intrinsic uncertainty, we can also attain a relatively large approximation error or parameterization error of Z θ with a large κ > τ 2σ as the distributional TD approximation error could be potentially large in practice. Under this circumstance, distributional RL algorithms may fail to speed up the convergence or achieve the superior performance compared with classical RL as O(κ 2 ) could be potentially large on some complex environments. If O(κ 2 ) is proper, distributional RL can still potentially perform reasonably due the to O(κ 2 )-stationary point guarantee. Theses theoretical results also coincide with past empirical observations (Dabney et al., 2018b; Ceron & Castro, 2021) , where distributional RL algorithms outperform classical RL in most cases, but are inferior in certain environments. Based on our results in Theorem 2, we contend that these certain environments have much intrinsic uncertainty, the distribution parameterization error between Z θ and the true value distribution under the distributional TD approximation is still too large (κ > τ 2σ ) to guarantee a favorable convergence point for distributional RL algorithms with different d p , which is intuitive.

4. EXPERIMENTS

We perform extensive experiments on eight continuous control MuJoCo games to validate the theoretical optimization advantage of distributional RL algorithms analyzed in Section 3, including the stable gradient behaviors of distributional RL to achieve the uniform stability as well as the acceleration effects determined by the distribution parameterization error. Implementation. Our implementation is based Soft Actor Critic (SAC) (Haarnoja et al., 2018) and distributional Soft Actor Critic (Ma et al., 2020) . We eliminate the optimization impact of entropy regularization in these algorithm implementations, and thus we denote the resulting algorithms as Actor Critic (AC) and Distributional Actor Critic (DAC) for the conciseness. For DAC, we firstly perform the C51 algorithm to the critic to extend the classical critic loss to the distributional version denoted by DAC (C51) as our theoretical analysis in Sections 3.2 and 3.3 are mainly based on categorical parameterization. We further apply our empirical demonstration on Quantile Regression distributional RL heuristically, i.e., Implicit Quantile Network (IQN), which is denoted as DAC (IQN). Hyper-parameters and more implementation details are provided in Appendix F.  $YHUDJH5HWXUQ DQW $& '$&& '$&,41 KXPDQRLG $& '$&& '$&,41 ZDONHUG $& '$&& '$&,41 ELSHGDOZDONHUKDUGFRUH $& '$&& '$&,41 7LPH6WHSVH $YHUDJH5HWXUQ KDOIFKHHWDK $& '$&& '$&,41 7LPH6WHSVH UHDFKHU $& '$&& '$&,41 7LPH6WHSVH VZLPPHU $& '$&& '$&,41 7LPH6WHSVH KXPDQRLGVWDQGXS $& '$&& '$&,41

4.1. PERFORMANCE AND UNIFORM STABILITY IN DISTRIBUTIONAL RL OPTIMIZATION

Figure 1 suggests that DAC (IQN) in orange lines outperforms its classical version AC (black lines) across all environments, while DAC (C51) in red lines is inferior to AC on humanoid, walker2d and reacher. This could be explained by a more flexible parameterization of IQN over C51. We then demonstrate the advantage of uniform optimization stability for distributional RL over classical RL. According to Theorem 1, the stable optimization of distribution loss with Neural FZI is described as a bounded loss difference for a neighboring dataset in terms of each state s and action a. In other words, the error bound holds by taking the supreme over each state the agent encounters. To measure this algorithm stability, while far from perfect, we consider to leverage the average gradient norms with respect to the state feature x(s) in the whole optimization process as the proxy due to the fact that the gradient could measure the sensitivity of loss function regarding each state the agent observes. From Figure 2 , it turns out that both DAC (C51) and DAC (IQN) entail a much smaller gradient norm magnitude as opposed to their classical version AC (black log lines) across all eight MuJoCo environments, which corroborates the theoretical advantage of the uniform optimization stability for distributional RL analyzed in Theorem 1. log x DQW $& '$&& '$&,41 KXPDQRLG $& '$&& '$&,41 ZDONHUG $& '$&& '$&,41 ELSHGDOZDONHUKDUGFRUH $& '$&& '$&,41 7LPH6WHSVH log x KDOIFKHHWDK $& '$&& '$&,41 7LPH6WHSVH UHDFKHU $& '$&& '$&,41 7LPH6WHSVH VZLPPHU $& '$&& '$&,41 7LPH6WHSVH KXPDQRLGVWDQGXS $& '$&& '$&,41 DQW $& '$&& '$&,41 KXPDQRLG $& '$&& '$&,41 ZDONHUG $& '$&& '$&,41 ELSHGDOZDONHUKDUGFRUH $& '$&& '$&,41 7LPH6WHSVH log KDOIFKHHWDK $& '$&& '$&,41 7LPH6WHSVH UHDFKHU $& '$&& '$&,41 7LPH6WHSVH VZLPPHU $& '$&& '$&,41 7LPH6WHSVH KXPDQRLGVWDQGXS $& '$&& '$&,41

4.2. SMOOTHNESS PROPERTY AND ACCELERATION EFFECT OF DISTRIBUTIONAL RL

Theorem 2 demonstrates that distributional RL can speed up the convergence if the distribution parameterization is appropriate, characterized by the variance of the gradient estimates with a small κ (case (2) in Theorem 2). To demonstrate it, we use the proxy by evaluating the 2 -norms of gradients with respect to network parameters of the critic for AC and DAC. We mainly focus on a direct comparison between vanilla AC and DAC algorithm, although their network architectures are slightly different. Similar results under the same architecture and via the value distribution decomposition of Eq. 6 are provided in Appendix G. Figure 3 showcases that both DAC (C51) and DAC (IQN) have smaller gradient norms in terms of network parameters θ compared with AC in the whole optimization process, which directly validates that distributional RL loss is more likely to enjoy smoothness properties in Proposition 1. In terms of acceleration effects, the property of stationary points, albeit being different, in cases (2) and (3) of Theorem 2 guarantees bounded gradient norms, but the precise evaluation of κ is tricky in order to discriminate either case (2) or (3) for each algorithm in a specific environment. Nevertheless, by considering the fact that DAC (IQN) outperforms DAC (C51) in most environments in Figure 1 , we hypothesize that the inferiority of DAC (C51) on humanoid, walker2d and reacher could be owing to its larger parameterization errors κ in these environments. This results in the worse performance of DAC (C51) compared with DAC (IQN) that is more likely to accord with the case (3) in Theorem 2 due to its richer distribution expressiveness power than C51.

5. DISCUSSIONS AND CONCLUSION

Our optimization analysis of distributional RL is based on the categorical parameterization, and the alternative analysis on Wasserstein distance can be an integral complementary for our conclusions. Acceleration effects could be further investigated to explain whether a typical distributional RL algorithm can perform favorably in a specific environment. We leave them as future works. In our paper, we answer the question: how does value distribution in distributional RL help the optimization from two perspectives, including the stable optimization analysis based on the smoothness property of categorical distributional loss, as well as the acceleration effects determined by the variance of gradient estimates. We theoretically and empirically show that distributional RL embraces stable gradient behaviors and could speed up the convergence if the distribution approximation is desirable or the parameterization error is sufficiently small.

A DERIVATION OF CATEGORICAL DISTRIBUTIONAL LOSS

We show the derivation details of the Categorical distribution loss starting from KL divergence between p and q θ . p i is the cumulative probability increment of target distribution {Y i } i∈[n] within the i-th bin, and q θ corresponds to a (normalized) histogram, and has density values f θ i (x(s)) wi per bin. Thus, we have: D KL (p s,a , q s,a θ ) = b a p s,a (y) log p s,a (y)dy -b a p s,a (y) log q s,a θ (y)dy ∝ - b a p s,a (y) log q s,a θ (y)dy = - k i=1 li+wi li p s,a (y) log f θ i (x(s)) w i dy = - k i=1 log f θ i (x(s)) w i (F s,a (l i + w i ) -F s,a (l i )) p s,a i ∝ - k i=1 p s,a i log f θ i (x(s)) where the first ∝ results from the fixed target p s,a in the Neural FZI framework. The second equality is based on the categorical parameterization for the density function q s,a θ . The last ∝ holds because the width parameter w i can be ignored for this minimization problem.

B PROOF OF PROPOSITION 1

Proof. For the Categorical distributional loss below, (2) L θ (s, a) is kl-Lipschitz continuous. We compute the gradient of the Histogram distributional loss regarding θ i : ∂ ∂θ i k j=1 p s,a j log f θ j (x(s)) = k j=1 p s,a j 1 f θ j (x(s)) ∇ θi f θ j (x(s)) = k j=1 p s,a j 1 f θ j (x(s)) f θ i (x(s))(δ ij -f θ j (x(s)))x(s) =   p s,a i (1 -f θ i (x(s))) - k j =i p s,a j f θ i (x(s))   x(s) = p s,a i -p s,a i f θ i (x(s)) -(1 -p s,a i )f θ i (x(s)) x(s) = p s,a i -f θ i (x(s)) x(s) where δ ij = 1 if i = j, otherwise 0. Then, as we have x(s) ≤ l, we bound the norm of its gradient ∂ ∂θ k j=1 p j log f θ j (x(s)) ≤ k i=1 ∂ ∂θ i k j=1 p j log f θ j (x(s)) = k i=1 p s,a i -f θ i (x(s)) x(s) ≤ k i=1 |p s,a i -f θ i (x(s))| x(s) ≤ kl The last equality satisfies because |p i -f θ i (x(s))| is less than 1 and even smaller. Therefore, we obtain that L θ is kl-Lipschitz. (3) L θ is kl 2 -Lipschitz smooth. A lemma is that log(1 + exp(x)) is 1 4 -smooth as its secondorder gradient is bounded by 1 4 , and if g(w) is β-smooth w.r.t. w, then g( x, w ) is β x 2 -smooth. Based on this knowledge, we firstly focus on the 1-dimensional case of function log f θ j (z), where f θ j (z) = exp zj k i=1 exp zi . As we have derived, we know that ∂ ∂θi log f θ j (z j ) = δ ij -f θ i (z i ). Then the second-order gradient is ∂ 2 log f θ j (z) ∂θi∂θ k = -f θ i (z)(δ ik -f θ k (z)) = f θ i (z)(f θ k (z) -1) if i = k, otherwise f θ i (z)f θ k (z). Clearly, | ∂ 2 log f θ j (z) ∂θi∂θ k | ≤ 1, which implies that log f θ j (z) is 1-smooth. Thus, log f θ j ( x, θ i ) is x 2 -smooth, or l 2 -smooth. Further, k j=1 p s,a j log f θ j (x(s) ) is also l 2 -smooth as we have ∇ θi k j=1 p s,a j log f µ j (x(s)) -∇ θi k j=1 p s,a j log f ν j (x(s)) ≤ k j=1 p s,a j ∇ θi log f µ j (x(s)) -∇ θi log f ν j (x(s)) ≤ k j=1 p s,a j • l 2 µ -ν = l 2 µ -ν for each parameter µ and ν. Therefore, we further have ∇ θ k j=1 p s,a j log f µ j (x(s)) -∇ θ k j=1 p s,a j log f ν j (x(s)) ≤ k i=1 ∇ θi k j=1 p s,a j log f µ j (x(s)) -∇ θi k j=1 p s,a j log f ν j (x(s)) ≤ k i=1 l 2 µ -ν = kl 2 µ -ν Finally, we conclude that L θ (s, a) is kl 2 -smooth.

C PROOF OF THEOREM 1

Proof. Consider the stochastic gradient descent rule as G λ,L (θ) = θ -λ∇ θ L θ . Firstly, we provide two definitions about L θ for the following proof. Definition 3. (σ-bounded) An update rule is σ-bounded if sup θ θ -λ∇ θ L θ ≤ σ. Definition 4. (η-expansive) An update rule is η-expansive if sup v,w G λ,L (v)-G λ,L (w) u-w ≤ η. Lemma 1. (Grow Recursion, Lemma 2.5 (Hardt et al., 2016) ) Fix an arbitrary sequence of updates G 1 , ..., G T and another sequence G 1 , ..., G T . Let θ 0 = θ 0 be the starting point and define δ t = θ i -θ t , where θ t and θ t are defined recursively through θ t+1 = G λ,L (θ t ), θ t+1 = G λ,L (θ t ) Then we have the recurrence relation: δ t+1 ≤ ηδ t G t = G t is η-expansive min(η, 1)δ t + 2σ t G t and G t are σ-bounded , G t is η expansive Lemma 2. (Lipschitz Continuity) Assume L θ is L-Lipschitz, the gradient update G λ,L is (λL)- bounded. Proof. θ -G λ,L (θ) = λ∇ θ L θ ≤ λL Lemma 3. (Lipschitz Smoothness) Assume L θ is β-smooth, then for any λ ≤ 2 β , the gradient update G λ,L is 1-expansive. Proof. Please refer to Lemma 3.7 in (Hardt et al., 2016) for the proof. Based on all the results above, we start to prove Theorem 1. Our proof is largely based on (Hardt et al., 2016) , but it is applicable in distributional RL setting as well as considering desirable properties of histogram distributional loss. According to Proposition 1, we attain that L θ is kl-Lipschitz as well as kl 2 -smooth, and thus based on Lemma 2 and Lemma 3, we have G λ,L is (λkl)-bounded, and 1-expansive if λ ≤ 2 kl 2 . In the step t, SGD selects samples that are both in D and D , with probability 1 -1 n . In this case, G t = G t , and thus δ t+1 ≤ δ t as G t is 1-expansive based on Lemma 1. The other case is that samples selected are different with probability 1 n , where δ t+1 ≤ δ t + 2λ t kl based on Lemma 1. Thus, if λ t ≤ 2 kl 2 , for each state s and action a, we have: E L θ T (s, a) -L θ T (s, a) ≤ klE [δ T ] , where δ T = θ T -θ T ≤ kl (1 - 1 n )E [δ T -1 ] + 1 n E [δ T -1 ] + 2λ T -1 kl n = kl E [δ T -1 ] + 2λ T -1 kl n = kl E [δ 0 ] + T -1 t=0 2λ t kl n ≤ 2k 2 l 2 n T -1 t=0 2 kl 2 = 4kT n Since this bounds hold for all D, D and s, a, we attain the uniform stability in Definition 1 for our categorical distributional loss applied in distributional RL. Define the population risk as: R [θ] = E x L θ (s, a) ∇L θ (δ {x=E[Z π (s,a)]} , f s,a θ ) 2 where the last first equation is according to the definition of Lipschitz-smoothness, and the last second one is based on the updating rule of θ. Next, we take the expectation on both sides, E [G(θ t+1 ) -G(θ t )] ≤ -λE ∇G(θ t ) 2 + kl 2 λ 2 2 E ∇L θ (δ {x=E[Z π (s,a)]} , f s,a θ ) -∇G(θ t ) + ∇G(θ t ) 2 ≤ -λE ∇G(θ t ) 2 + kl 2 λ 2 2 E ∇L θ (δ {x=E[Z π (s,a)]} , f s,a θ ) -∇G(θ t ) 2 + kl 2 λ 2 2 E ∇G(θ t ) 2 = λ(kl 2 λ -2) 2 E ∇G(θ t ) 2 + kl 2 λ 2 2 σ 2 ≤ - λ 2 E ∇G(θ t ) 2 + kl 2 λ 2 2 σ 2 ) where the first two equation hold because ∇G(θ) = E [∇L θ ] and the last inequality comes from λ ≤ 1 kl 2 . Through the summation, we obtain that E [G(θ T ) -G(θ 0 )] ≤ - λ 2 T -1 t=0 E ∇G(θ t ) 2 + kl 2 λ 2 T 2 σ 2 We let E [G(θ T )] = 0, we have 1 T T -1 t=0 E ∇G(θ t ) 2 ≤ 2G(θ 0 ) λT + kl 2 λσ 2 By setting λ ≤ τ 2 2kl 2 σ 2 and T = 4G(θ0) λτ 2 , we can have (2) and (3) We are still based on the kl 2 -smoothness of L(p s,a , f s,a θ ). 1 T T -1 t=0 E ∇G(θ t ) 2 ≤ τ 2 , G(θ t+1 ) -G(θ t ) a -b 2 -a 2 -b 2 , and the last inequality is according to λ ≤ 1 kl 2 . After taking the expectation, we have E [G(θ t+1 ) -G(θ t )] ≤ - λ 2 E ∇G(θ t ) 2 + λ 2 E ∇G(θ t ) -∇L θ (p s,a , f s,a θ ) 2 ≤ - λ 2 E ∇G(θ t ) 2 + λ 2 (1 -) 2 σ 2 + 2 κσ 2 where the last inequality is based on Proposition 2. We take the summation, and therefore, E [G(θ T ) -G(θ 0 )] ≤ - λ 2 T -1 t=0 E ∇G(θ t ) 2 + T λ 2 (1 -) 2 σ 2 + 2 κσ 2 We let E [G(θ T )] = 0 and = 1 1+κ , then, 1 T T -1 t=0 E ∇G(θ t ) 2 ≤ 2G(θ 0 ) λT + (1 -) 2 σ 2 + 2 κσ 2 = 2G(θ 0 ) λT + 2κ 2 (1 + κ) 2 σ 2 ≤ 2G(θ 0 ) λT + 2κ 2 σ 2 (22) If κ ≤ τ 2σ and let T = 4G(θ0) λτ 2 , this leads to 1 T T -1 t=0 E ∇G(θ t ) 2 ≤ τ 2 , i.e. , τ -stationary point, with the sample complexity as O( 1 τ 2 ). Thus, (2) has been proved. On the other hand, if κ > τ 2σ , we set T = G(θ0) λκ 2 σ 2 . This implies that 1 T T -1 t=0 E ∇G(θ t ) 2 ≤ 4κ 2 σ 2 = O(κ 2 ). Therefore, the degree of stationary point is determined the degree of distribution approximation measured by κ. Thus, we obtain (3).

F IMPLEMENTATION DETAILS

Our implementation is directly adapted from the source code in (Ma et al., 2020) . For DAC (IQN), we consider the quantile regression for the distribution estimation on the critic loss. Instead of using fixed quantiles in QR-DQN (Dabney et al., 2018b) , we leverage the quantile fraction generation suggests DAC (C51) still enjoys smaller gradient norms compared with AC in this fair comparison setting. Results under Value Distribution Decomposition We also provide gradient norms of both expectation and distribution based on the value distribution decomposition in Eq. 6. Similar results can be still observed in Figure 5 .  For the uniform notation, we ignore k in Neural FZI. If {Z θ : θ ∈ Θ} is sufficiently large enough such that it contains T opt Z θ * , then optimizing Neural FZI in Eq. 2 leads to Z θ = T opt Z θ * . Similarly, optimizing Neural FQI yields Q θ = T opt Q θ * ideally. We denote w E as the interval that E Z target(s,a) and Z target = T opt Z θ * in expectation as shown in Neural FZI. f s,a → q s,a as w max = max i w i → 0, where q s,a is the continuous target probability density function. Then we have: s,a) ]} log f s,a θ (w E )dx → -log q s,a θ (E Z target (s, a) ) as w E → 0 L θ (δ {x=E[Z π (s,a)]} , f s,a θ ) = - x∈w E δ {x=E[Z target ( (25) where we know δ(x)dx = 1. Since {Z θ : θ ∈ Θ} is sufficiently large enough, the KL minimizer would be q s,a This indicates that the optimal value distribution Z θ may take other values instead of the expectation, but the probabilities of these events happen are 0. This implies that minimizing the KL divergence in terms of the dirac Delta function is "almost" equivalent to the minimizer of mean squared loss.



Figure 1: Performance. Learning curve of AC, DAC (C51) and DAC (IQN) over 5 seeds with smooth size 5 across eight MuJoCo games.

Figure 2: Stable Optimization. The critic gradient norms in the logarithmic scale regarding the state during the training of AC, DAC (C51), DAC (IQN) over 5 seeds on eight MuJoCo environments.

Figure 3: Acceleration Effect. The critic gradient norms in the logarithmic scale regarding network parameters in the training of AC, DAC (C51), DAC (IQN) over 5 seeds on MuJoCo environments.

x(s) θ j )(1) Convexity. Note that -logexp(x(s) θi) k j=1 exp(x(s) θj ) = log k j=1 exp x(s) θ j -x(s) θ i ,the first term is Log-sum-exp, which is convex (see Convex optimization by Boyd and Vandenberghe), and the second term is affine function. Thus, L θ (s, a) is convex.

implying that the degenerated loss function based on the expectation δ {x=E[Z π (s,a)]} can achieve τ -stationary point if the sample complexity T = O( 1 τ 4 ).

Figure 4: The critic gradient norms in the logarithmic scale during the training of AC and DAC (C51) over 5 seeds on three MuJoCo games. We keep the same DAC network architecture and evaluate based on the expectation of the represented value distribution.

Figure 5: The critic gradient norms in the logarithmic scale during the training of AC and DAC (C51) over 5 seeds on three MuJoCo games. Results of AC is the expectation part calculated via the Value Distribution Decomposition.

= δ E[Z target (s,a)] , where δ E[Z target (s,a)] is a Dirac Delta function centered at E [Z target (s, a)] and can be viewed as a generalized probability density function. According to the definition of Dirac Delta function, as w max → 0, we attainP ( Z θ (s, a) = E Z target (s, a)= E T opt Z θ * (s, a) linearity of expectation analyzed in Lemma 4 of(Bellemare et al., 2017a), we haveE T opt Z θ * (s, a) = T opt E [Z θ * (s, a)] = T opt Q θ * (s, a)(27)Finally, for each iteration in Neural FZI, the following equation always holds:P ( Z θ (s, a) = T opt Q θ * (s, a)) = 1 as max i w i → 0 (28)

The choice of d p . d p in fact has two-fold impacts on the optimization of the whole Neural FZI. Firstly, d p determines the convergence rate of distributional Bellman update. For instance, distributional Bellman operator under Crámer distance is γ

annex

Ethics Statement. Due to the fact that our study is about the theoretical properties of distributional RL algorithms, we do not think our research is involved with any ethics issues.Reproducibility Statement. As stated in Section 4, our implementation is based on the public code of SAC (Haarnoja et al., 2018) and Distributional SAC (Ma et al., 2020) . We also provide implementation details in Appendix F for reproducibility. For the theoretical results, rigorous proof is also given in Appendix from A to E. based on IQN (Dabney et al., 2018a ) that uniformly samples quantile fractions in order to approximate the full quantile function. In particular, we fix the number of quantile fractions as N and keep them in an ascending order. Besides, we adapt the sampling as τ 0 = 0,

F.1 HYPER-PARAMETERS AND NETWORK STRUCTURE

We adopt the same hyper-parameters, which is listed in Table 1 and network structure as in the original distributional SAC paper (Ma et al., 2020) .As suggested in Table 1 , after a line search for the hyperparameter tuning, we select l k as 500, 10,000, 15,000, 160, 50, 5,000, 500, 500 for ant, halfcheetah, humanoidstand, swimmer, bipedalwalkerhardcore, humanoid, walker2d and reacher, respectively. G EXPERIMENTAL RESULTS ON ACCELERATION EFFECTS OF DISTRIBUTIONAL RL Same Architecture. For a fair comparison, we keep the same DAC network architecture and evaluate the gradient norms of DAC (C51) and a variant of AC, which is optimized based on the expectation of the represented value distribution within the DAC implementation framework. Figure 4 

