DISTRIBUTIONAL REINFORCEMENT LEARNING VIA SINKHORN ITERATIONS

Abstract

Distributional reinforcement learning (RL) is a class of state-of-the-art algorithms that estimate the entire distribution of the total return rather than only its expectation. The empirical success of distributional RL is determined by the representation of return distributions and the choice of distribution divergence. In this paper, we propose a new class of Sinkhorn distributional RL (SinkhornDRL) algorithm that learns a finite set of statistics, i.e., deterministic samples, from each return distribution and then uses Sinkhorn iterations to evaluate the Sinkhorn distance between the current and target Bellmen distributions. Sinkhorn divergence features as the interpolation between the Wasserstein distance and Maximum Mean Discrepancy (MMD). SinkhornDRL finds a sweet spot by taking advantage of the geometry of optimal transport-based distance and the unbiased gradient estimate property of MMD. Finally, compared to state-of-the-art algorithms, Sinkhorn-DRL's competitive performance is demonstrated on the suite of 55 Atari games. Under review as a conference paper at ICLR 2023 the non-crossing issue in the quantile estimation (Zhou et al., 2020) , while sample-based Sinkhorn distributional algorithm can naturally circumvent this problem. In this paper, we propose a novel distributional RL family based on Sinkhorn divergence. Firstly, we show key roles of distribution divergence and value distribution representation in the design of distributional RL algorithms. After a detailed introduction of our proposed SinkhornDRL algorithm, with a non-trivial proof, we theoretically analyze the convergence property of distributional Bellman operators under Sinkhorn divergence. A regularized MMD equivalence with Sinkhorn divergence is also established, interpreting its empirical success in real applications. Finally, we compare the performance of our SinkhornDRL algorithm with typical baselines on 55 Atari games, verifying the competitive performance of our proposal. Our method inspires researchers to find a trade-off that simultaneously leverages the geometry of the Wasserstein distance and the favorable unbiased gradient estimate property of MMD while designing new distributional RL algorithms in the future. PRELIMINARY KNOWLEDGE In classical RL, an agent interacts with an environment via a Markov decision process (MDP), a 5-tuple (S, A, R, P, γ), where S and A are the state and action spaces, respectively. P is the environment transition dynamics, R is the reward function and γ ∈ (0, 1) is the discount factor. From Value function to Value distribution. Given a policy π, the discounted sum of future rewards is a random variable Z π (s, a) = ∞ t=0 γ t R(s t , a t ), where s 0 = s, a 0 = a, s t+1 ∼ P (•|s t , a t ), and a t ∼ π(•|s t ). In the control setting, expectation-based RL is based on the action-value function Q π (s, a), which is the expectation of Z π (s, a), i.e., Q π (s, a) = E [Z π (s, a)]. By contrast, distributional RL focuses on the action-value distribution, the full distribution of Z π (s, a). The incorporation of additional distributional knowledge intuitively interprets its empirical success.

1. INTRODUCTION

Classical reinforcement learning (RL) algorithms are normally based on the expectation of discounted cumulative rewards that an agent observes while interacting with the environment. Recently, a new class of RL algorithms called distributional RL estimates the full distribution of total returns and has exhibited the state-of-the-art performance in a wide range of environments (Bellemare et al., 2017a; Dabney et al., 2018b; a; Yang et al., 2019; Zhou et al., 2020; Nguyen et al., 2020) . In distributional RL literature, it is easily recognized that algorithms based on either Wasserstein distance or MMD have gained great attention due to their superior performance. Their mutual connection from the perspective of mathematical properties intrigues us to explore further in order to design new algorithms. Particularly, Wasserstein distance, long known to be a powerful tool to compare probability distributions with non-overlapping supports, has recently emerged as an appealing contender in various machine learning applications. It is known that Wasserstein distance was long disregarded because of its computational burden in its original form to solve an expensive network flow problem. However, recent works (Sinkhorn, 1967; Genevay et al., 2018) have shown that this cost can be largely mitigated by settling for cheaper approximations through strongly convex regularizers. The benefit of this regularization has opened the path to wider applications of the Wasserstein distance in relevant learning problems, including the design of distributional RL algorithms. The Sinkhorn divergence (Sinkhorn, 1967) introduces the entropic regularization on the Wassertein distance, allowing it tractable for the evaluation especially in high-dimensions. It has been successfully applied in numerous crucial machine learning developments, including the Sinkhorn-GAN (Genevay et al., 2018) and Sinkhorn-based adversarial training (Wong et al., 2019) . More importantly, it has been shown that Sinkhorn divergence interpolates Wasserstein ditance and MMD, and their equivalence form can be well established in the limit cases (Feydy et al., 2019; Ramdas et al., 2017; Nguyen et al., 2020) . However, a Sinkhorn-based distributional RL algorithm has not yet been formally proposed and its connection with algorithms based on Wasserstein distance and MMD is also less studied. Therefore, a natural question is can we design a new class of distributional RL algorithms via Sinkhorn divergence, thus bridging the gap between existing two main branches of distributional RL algorithms? Moreover, the dominant quantile regression-based algorithms, e.g., QR-DQN (Dabney et al., 2018b)  where s ∼ P (•|s, a) and a ∼ π (•|s ). The equality implies random variables of both sides are equal in distribution. The distributional Bellman operator T π is contractive under certain distribution divergence metrics. We provide a detailed discussion about more related works in Appendix A.

2.2. DIVERGENCES BETWEEN MEASURES

Optimal Transport (OT) and Wasserstein Distance. The optimal transport (OT) metric between two probability measures (µ, ν) is defined as the solution of the linear program: min Π∈Π(µ,ν) c(x, y)dΠ(x, y), ( ) where c is the cost function and Π is the joint distribution with marginals (µ, ν). Wasserstein distance (a.k.a. earth mover distance) is a special case of optimal transport with the Euclidean norm as the cost function. In particular, given two scalar random variables X and Y , p-Wasserstein metric W p between the distributions of X and Y can be simplified as W p (X, Y ) = 1 0 F -1 X (ω) -F -1 Y (ω) p dω 1/p , where F -1 is the inverse cumulative distribution function of a random variable. The desirable geometric property of Wasserstein distance allows it to recover full support of measures, but it suffers from the curse of dimension (Genevay et al., 2019; Arjovsky et al., 2017) . Maximum Mean Discrepancy. The squared Maximum Mean Discrepancy (MMD) MMD 2 k with the kernel k is formulated as MMD 2 k = E [k (X, X )] + E [k (Y, Y )] -2E [k(X, Y )] , where k(•, •) is a continuous kernel on X . X (resp. Y ) is a random variable independent of X (resp. Y ). If k is a trivial kernel, MMD degenerates to the energy distance. Mathematically, the "flat" geometry that MMD induces on the space of probability measures does not faithfully lift the ground distance (Feydy et al., 2019) , but MMD is cheaper to compute than OT and has a smaller sample complexity, i.e., approximating the distance with samples of measures (Genevay et al., 2019) . We provide the detailed introduction of more distribution divergences in Appendix B. where the target Y i = R(s i , a i ) + γZ k θ * (s i , π Z (s )) with π Z (s ) = argmax a E Z k θ * (s , a ) is fixed within every T target steps to update target network Z θ * , and d p is the distribution divergence.

3.2. KEY ROLES OF d p AND Z θ

Within the Neural Fitted Z-Iteration framework proposed in Eq. 6, we observe that the choice of representation manner on Z θ and the metric d p are pivotal for the distributional RL algorithms. For instance, QR-DQN (Dabney et al., 2018b) approximates Wasserstein distance W p , which leverages quantiles to represent the distribution of Z θ . C51 (Bellemare et al., 2017a) represents Z θ via a categorical distribution under the convergence of Cramér distance (Bellemare et al., 2017b; Rowland et al., 2018) , while MMD distributional RL (MMDDRL) (Nguyen et al., 2020) learns samples to represent the distribution of Z θ based on MMD. We compare characteristics of these distribution divergence, including the convergence rate and sample complexity, in Table 1 . Theoretical results regarding Sinkhorn divergence is based on (Genevay et al., 2019) and the detailed convergence proof of other distances is also provided in Appendix B. In summary, we argue that d p and Z θ are two crucial factors in distributional RL design, based on which we introduce Sinkhorn distributional RL. Algorithm d p Distribution Divergence Representation Z θ Convergence Rate of T π Sample Complexity of d p C51 Cramér distance Histogram √ γ QR-DQN Wasserstein distance Quantiles γ O(n -1 d ) MMDDRL MMD Samples γ α/2 with kernel k α O(1/n) SinkhornDRL (ours) Sinkhorn divergence Samples γ (ε → 0) γ α/2 (ε → ∞) O(n e κ ε ε d/2 √ n ) (ε → 0) O(n -1 2 ) (ε → ∞) Table 1 : Comparison between typical distributional RL algorithms under different distribution divergences and represention of Z θ . k α = -x -y α in MMDDRL, d is the sample dimension and κ = 2βd + c ∞ , where the cost function c is β-Lipschitz (Genevay et al., 2019) . Sample complexity of MMD can be improved to O(1/n) using kernel herding technique (Chen et al., 2012) .

4. SINKHORN DISTRIBUTIONAL RL (SINKHORNDRL)

In this section, we firstly introduce Sinkhorn divergence and apply it in distributional RL. Next, we conduct a theoretical analysis about the convergence and a new moment matching manner of our algorithm under the Sinkhorn divergence. Finally, a practical Sinkhorn iteration algorithm is introduced to evaluate the Sinkhorn divergence.

4.1. SINKHORN DIVERGENCE AND GENETIC ALGORITHM

We design Sinkhorn distributional RL algorithm via Sinkhorn divergence. Sinkhorn divergence (Sinkhorn, 1967) is a tractable loss to approximate the optimal transport problem by leveraging an entropic regularization to turn the original Wasserstein distance into a differentiable and more robust quantity. The resulting loss can be computed using Sinkhorn fixed point iterations, which is naturally suitable for modern deep learning frameworks. In particular, the entropic smoothing generates a family of losses interpolating between MMD. As such, it allows us to find a sweet trade-off that simultaneously leverages the geometry of Wasserstein distance on the one hand, and the favorable high-dimensional sample complexity and unbiased gradient estimates of MMD. We introduce the entropic regularized Wassertein distance W c,ε (µ, ν) as min Π∈Π(µ,ν) c(x, y)dΠ(x, y) + εKL(Π|µ ⊗ ν), where KL(Π|µ ⊗ ν) = log Π(x,y) dµ(x)dν(y) dΠ(x, y) is a strongly convex regularization. The impact of this entropy regularization is similar to 2 ridge regularization in linear regression. Next, the sinkhorn loss (Feydy et al., 2019; Genevay et al., 2018) between two measures µ and ν is defined as W c,ε (µ, ν) = 2W c,ε (µ, ν) -W c,ε (µ, µ) -W c,ε (ν, ν). As demonstrated by (Feydy et al., 2019) , the Sinkhorn divergence W c,ε (µ, ν) is convex, smooth and positive definite that metrizes the convergence in law. In statistical physics, W c,ε (µ, ν) can be re-factored as a projection problem: W c,ε (µ, ν) := min Π∈Π(µ,ν) KL (Π|K) , ( ) where K is the Gibbs distribution and its density function satisfies dK(x, y) = e -c(x,y) ε dµ(x)dν(y). This problem is often referred to as the "static Schrödinger problem" (Léonard, 2013; Rüschendorf & Thomsen, 1998) as it was initially considered in statistical physics. Distributional RL with Sinkhorn Divergence and Particle Representation. The key of applying Sinkhorn divergence in distributional RL is to simply leverage the Sinkhorn loss W c,ε to measure the distance between the current action-value distribution Z θ (s, a) and the target distribution T π Z θ (s, a), yielding W c,ε (Z θ (s, a), T π Z θ (s, a)) for each s, a pairs. In terms of the representation for Z θ (s, a), we employ the unrestricted statistics, i.e., deterministic samples, due to its superiority in MMDDRL (Nguyen et al., 2020) , instead of using predefined statistic functionals, e.g., quantiles in QR-DQN (Dabney et al., 2018b) or categorical distribution in C51 (Bellemare et al., 2017a  a * ← arg max a ∈A 1 N N i=1 Z θ (s , a ) i 5: end if 6: TZ i ← r + γZ θ * (s , a * ) i , ∀1 ≤ i ≤ N Output: W c,ε {Z θ (s, a) i } N i=1 , {TZ θ (s, a) j } N j=1 More concretely, we use neural networks to generate samples that approximate the value distribution. This can be expressed as Z θ (s, a) := {Z θ (s, a) i } N i=1 , where N is the number of generated samples. We refer to the samples {Z θ (s, a) i } N i=1 as particles. Then we leverage the Dirac mixture 1 N N i=1 δ Z θ (s,a)i to approximate the true density function of Z π (s, a), thus minimizing the Sinkhorn divergence between the approximate distribution and its distributional Bellman target. A detailed and generic distributional RL algorithm with Sinkhorn divergence and particle representation is provided in Algorithm 1. Remark. From the general algorithm framework in Algorithm 1, our SinkhornDRL generally modifies the distribution divergence comparing the state-of-the-art MMDDRL (Nguyen et al., 2020) , but SinkhornDRL fundamentally falls into Wasserstein distance-based distributional RL family as discussed in Appendix A. As such, QR-DQN and MMDDRL are direct counterparts for Sinkhorn-DRL, and the follow-up works IQN (Dabney et al., 2018a) and FQF (Yang et al., 2019) can naturally extend both MMDDRL and SinkhornDRL as discussed in (Nguyen et al., 2020) .

4.2. THEORETICAL ANALYSIS UNDER SINKHORN DIVERGENCE

Convergence. Firstly, we denote the supreme form of Sinkhorn divergence as W ∞ c,ε (µ, ν): W ∞ c,ε (µ, ν) = sup (x,a)∈S×A W c,ε (µ(x, a), ν(x, a)). ( ) We will use W ∞ c,ε (µ, ν) to establish the convergence of T π in Theorem 1. Theorem 1. If we leverage Sinkhorn loss W c,ε (µ, ν) in Eq. 8 as the distribution divergence in distributional RL, and choose the unrectified kernel k α := -x -y α as -c (α > 0), it holds that (1) (ε → 0) W c,ε (µ, ν) → 2W α (µ, ν). When ε = 0, T π is a γ-contraction under W ∞ c,ε . (2) (ε → +∞) W c,ε (µ, ν) → MMD 2 kα (µ, ν). When ε = +∞, T π is γ α/2 -contractive under W ∞ c,ε . (3) (ε ∈ (0, +∞)), T π is a contractive operator under W ∞ c,ε . The related non-constant contraction factor ∆(γ, α) < 1 also depends on the distribution sequence in distributional Bellman iterations. We provide the long yet rigorous proof of Theorem 1 in Appendix C. Theorem 1 (1) and (2) are follow-up conclusions in terms of the convergence behavior of T π based on the interpolation relationship between Sinkhorn divergence with Wasserstein distance and MMD (Genevay et al., 2018) . Our key theoretical contribution is for the general ε ∈ (0, ∞), in which we conclude that T π is a contractive operator. The crux of the proof is two-fold. Firstly, we show the a variant of scale sensitive property of Sinkhorn divergence when c = -κ α , where the resulting non-constant scaling factor is also determined by the specified two probability measures. Next, we propose a new distribution Contraction mapping theorem in Theorem 2 of Appendix C, based on which we eventually arrive at the convergence of distributional Bellman operator under W ∞ c,ε . Intriguingly yet reasonably, the contraction factor ∆(γ, α) is non-constant but a function less than 1 that also depends on the distribution sequence while iteratively applying distribution Bellman updates. Our non-trivial proof about Sinkhorn divergence can even contribute to the optimal transport literature. Consistency with Related Conclusions. As Sinkhorn divergence interpolates between Wasserstein distance and MMD, its contraction property when the cost function holds c = -k α for the general ε ∈ [0, ∞] is intuitive. Note that if we choose Gaussian kernels as the cost function, there will be no concise and consistent contraction results as Theorem 1 (3). This conclusion is also consistent with MMDDRL (Nguyen et al., 2020) , where T π is generally not a contraction operator under MMD equipped with Gaussian kernels as a counterexample has been pointed out in MMDDRL (when ε → +∞). To be consistent with the contraction property analyzed in our theory (Theorem 1 (3)), we employ the rectified kernel k α as the cost function in our experiments and set α = 2, under which SinkhornDRL suggests a favorable performance in Section 5. Regularized Moment Matching under Sinkhorn Divergence Associated with Gaussian Kernels. We further examine the potential connection between SinkhornDRL with existing distributional RL families. Inspired by the similar manner in MMDDRL (Nguyen et al., 2020) , we find that the Sinkhorn divergence with the Gaussian kernel can also promote to match all moments be-tween two distributions. More specifically, the Sinkhorn divergence can be rewritten as a regularized moment matching form in Proposition 1. Proposition 1. For ε ∈ (0, +∞), Sinkhorn divergence W c,ε (µ, ν) associated with Gaussian kernels k(x, y) = exp(-(x -y) 2 /(2σ 2 )) as -c, is equivalent to W c,ε (µ, ν) := ∞ n=0 1 σ 2n n! Mn (µ) -Mn (ν) 2 + εE log (Π * ε (X, Y )) 2 Π * ε (X, X )Π * ε (Y, Y ) , ( ) where Π * ε denotes the optimal Π determined by ε by evaluating the Sinkhorn divergence via min Π∈Π(µ,ν) W c,ε (µ, ν). Mn (µ) = E x∼µ e -x 2 /(2σ 2 ) x n , and similarly for Mn (ν). We provide the proof of Proposition 1 in Appendix D. Similar to MMDDRL associated with a Gaussian kernel (Nguyen et al., 2020) , Sinkhorn divergence approximately performs a regularized moment matching scaled by e -x 2 /(2σ 2 ) . Equivalence to Regularized MMD Distributional RL. Based on Proposition 1, we can immediately establish the connection between Sinkhorn divergence and MMD in Corollary 1, indicating that minimizing Sinkhorn divergence between two distributions is equivalent to minimizing a regularized squared MMD. Corollary 1. For ε ∈ (0, +∞) and denote Π * ε as the optimal Π by evaluating the Sinkhorn divergence, it holds that W c,ε := MMD 2 -c (µ, ν) + εE log (Π * ε (X, Y )) 2 Π * ε (X, X )Π * ε (Y, Y ) , where we use W c,ε to replace W c,ε (µ, ν) for short. Proof of Corollary 1 is provided in Appendix D. It is worthy of noting that this equivalence is established for the general case when ε ∈ (0, +∞), and it does not hold in the limit cases when ε → 0 or +∞. For example, when ε → +∞, the second part including ε in Eq. 12 is not expected to dominate. This is owing to the fact that the regularization term would be 0 as Π * ε → µ ⊗ ν when ε → +∞. In summary, even though the Sinkhorn divergence was initially proposed to serve as an entropy regularized Wasserterin distance when the cost function c = κ α , it turns out that it is equivalent to a regularized MMD if associated with Gaussian kernels, as revealed in Corollary 1.

4.3. DISTRIBUTIONAL RL VIA SINKHORN ITERATIONS

The theoretical analysis in Section 4.2 sheds light on the behavior of distributional RL with Sinkhorn divergence, but another crucial issue we need to address is how to evaluate the Sinkhorn loss effectively. Due to the advantages of Sinkhorn divergence that both enjoys geometry property of optimal transport and the computational effectiveness of MMD, we can utilize Sinkhorn's algorithm, i.e., Sinkhorn Iterations (Sinkhorn, 1967; Genevay et al., 2018) , to evaluate the Sinkhorn loss. Notably, Sinkhorn iteration with L steps yields a differentiable and solvable efficiently loss function as the main burden involved in it is the matrix-vector multiplication, which streams well on the GPU with simply adding extra differentiable layers on the typical deep neural network, such as a DQN architecture.

Specifically, given two sample sequences {Z

i } N i=1 , {TZ j } N j=1 in the distributional RL algorithm, the optimal transport distance is equivalent to the form: min P ∈R N ×N + P, ĉ ; P 1 N = 1 N , P 1 N = 1 N , where the empirical cost function ĉi,j = c(Z i , TZ j ). By adding entropic regularization on optimal transport distance, Sinkhorn divergence can be viewed to restrict the search space of P in the following scaling form: P i,j = a i K i,j b j , where K i,j = e -ĉi,j /ε is the Gibbs kernel defined in Eq. 9. This allows us to leverage iterations regarding the vectors a and b. More specifically, we initialize b 0 = 1 N , and then the Sinkhorn Algorithm 2 Sinkhorn Iterations to Approximate W c,ε {Z i } N i=1 , {TZ j } N j=1 Input: Two samples sequences {Z i } N i=1 , {TZ j } N j=1 , number of Sinkhorn iterations L and hyperparameter ε. 1: ĉi,j = c(Z i , TZ j ) for ∀i = 1, ..., N, j = 1, ..., N 2: K i,j = exp(-ĉ i,j /ε) 3: b 0 ← 1 N 4: for l = 1, 2, ..., L do 5: a l ← 1 N Kb l-1 , b l ← 1 N Ka l 6: end for 7: W c,ε {Z i } N i=1 , {TZ j } N j=1 = (K ĉ)b, a Return: W c,ε {Z i } N i=1 , {TZ j } N j=1 iterations are expressed as a l+1 ← 1 N Kb l and b l+1 ← 1 N K a l+1 , where • • indicates an entry-wise division. It has been proven that Sinkhorn iteration asymptotically converges to the true loss in a linear rate (Genevay et al., 2018; Franklin & Lorenz, 1989; Cuturi, 2013; Jason Altschuler, 2017) . We provide a detailed algorithm description of Sinkhorn iterations in Algorithm 2. With the efficient and differential Sinkhorn iterations, we can easily evaluate the Sinkhorn divergence and thus let our algorithm enjoy its theoretical advantages. In practice, we need to choose L and ε, and we conduct a rigorous sensitivity analysis in Section 5.

5. EXPERIMENTS

We demonstrate the effectiveness of SinkhornDRL as described in Algorithm 1 on the full 55 Atari 2600 games. Specifically, we leverage the same architecture as QR-DQN (Dabney et al., 2018b) , and replace the quantiles output with N particles, i.e., samples. In contrast to MMDDRL, Sinkhorn-DRL only changes the distribution divergence from MMD to Sinkhorn divergence, and therefore the potential superiority in the performance can be attributed to the advantages of Sinkhorn divergence. Baselines. Due to the interpolation feature of Sinkhorn divergence between Wassertein distance and MMDDRL, we choose three typical distributional RL algorithms as classic baselines, including QR-DQN (Dabney et al., 2018b) that approximates the Wasserstein distance, C51 (Bellemare et al., 2017a) and MMDDRL (Nguyen et al., 2020) , as well as DQN (Mnih et al., 2015) . MMDDRL algorithm is implemented with the same architecture as QRDQN, and leverages Gaussian kernels k h (x, y) = exp(-(x-y) 2 /h) with the kernel mixture trick covering a range of bandwidths h, which is same as the basic setting in the original MMDDQN paper (Nguyen et al., 2020) . We deploy all algorithms on 55 Atari 2600 games, and reported results are averaged over 3 seeds with the shade indicating the standard deviation. We runs 10M time steps (40M frames) for the computation cost reason, but we report learning curves across all games to make results convincing enough. Hyperparameter settings. For a fair comparison with QR-DQN, C51 and MMDDRL, we used the same hyperparamters: the number of generated samples N = 200, Adam optimizer with lr = 0.00005, Adam = 0.01/32. We used a target network to compute the distributional Bellman target, which fits well in the Neural Fitted Z-Iteration framework. In addition, we choose number of Sinkhorn iterations L = 10 and smoothing hyperparameter ε = 10.0 in Section 5.1 as they are not sensitive within a proper interval as demonstrated in Section 5.2. We choose the unrectified kernel as the cost function, i.e.,-c = k α , and select α = 2 in k α in our SinkhornDRL algorithm.

5.1. PERFORMANCE OF SINKHORNDRL

Figure 1 illustrates that SinkhornDRL can achieve the competitive performance across 55 Atari games compared with various baseline algorithms with different metrics d p and representation manners on Z θ . On a large number of games, e.g., Tennis, Seaquest and Atlantis, SinkhornDRL can significantly outperform other baselines, especially on Tennis where other algorithms even fail to converge. The improvement of SinkhornDRL over MMDDRL empirically verifies the regularization advantage of the Sinkhorn as analyzed in Corollary 1. On some games, e.g., Breakout, Pong and SpaceInvaders, SinkhornDRL is on par with MMDDRL and other baselines, while on the last row in Figure 1 , SinkhornDRL is slightly inferior to the state-of-the-art algorithm. We provide learning curves of all typical distributional RL algorithms on all 55 Atari games in Appendix F, where SinkhornDRL still achieves the competitive performance in general. We conduct a ratio improvement comparison across 55 Atari games between SinkhornDRL with QRDQN and MMDDRL, respectively. Figure 2 showcases that by comparing with QRDQN (left), SinkhornDRL achieves better performance across almost half of considered games and the superiority of SinkhornDRL is significant across a large amount of games, including Venture, Seaquest, Tennis and Phoenix. This empirical outperformance verifies the effectiveness of smoothing Wassertein distance in distributional RL. In contrast with MMDDRL, the advantage of SinkhornDRL is reduced with the performance improvement on a smaller proportion of games, but a remarkable performance improvement for SinkhornDRL on a large amount of games can be easily observed. We also report mean and median of best human-normalized scores in Table 2 of Appendix E, where SinkhornDRL achieves almost state-of-the-art performance as MMDDRL on average. Therefore, we conclude that SinkhornDRL is competitive with the state-of-the-art distributional RL algorithms, e.g., MMDDRL, and can be extremely superior over existing algorithms on a large proportion of games. This empirical success can be owing to theoretical advantage of Sinkhorn divergence that simultaneously makes full use of the data geometry from Wasserstein distance and the unbiased gradient estimate property from MMD, which coincides with results in Theorem 1. 

5.2. SENSITIVITY ANALYSIS AND COMPUTATIONAL COST

The limit behavior connection in Theorem1 (1) and (2) between SinkhornDRL with QR-DQN and MMDDRL may not be rigorously verified in numerical experiments as an overly large or small ε will lead to numerical instability of Sinkhorn iterations in Algorithm 2, worsening its performance, as shown in Figure 3 (a) . In practice, we choose a proper ε = 10 across all games. SinkhornDRL also requires a proper number of iterations L and samples N . For example, a small N , e.g., N = 2 in Seaquest in Figure 3 (b) leads to the divergence of algorithms, while an overly large N can degrade the performance and meanwhile increases the computational burden (Appendix G). We conjecture that using larger networks to represent more samples is more likely to suffer from the overfitting issue, yielding the instability in the RL training (Bjorck et al., 2021) . Therefore, we choose N = 200 to attain an appealing performance with the computational effectiveness. For the computation cost (Appendix G), SinkhornDRL increases around 50% computation cost compared with QR-DQN and C51, but only slightly increases the overhead (by around 20%) in contrast to MMDDRL. Please refer to Appendix G for more detailed results and discussion.

6. DISCUSSIONS AND CONCLUSION

To extend our algorithm for better performance, implicit generative models, including parameterizing the cost function in Sinkhorn loss, can be further incorporated. We leave it as the future work. Moreover, other divergences, e.g., those that can also smooth Wassertein distance, can also be applied into the design of distributional RL algorithms in the future. In this paper, a novel family of distributional RL algorithms based on Sinkhorn divergence is proposed that accomplishes a competitive performance compared with the-state-of-the-art distributional RL algorithms on 55 Atari games. Theoretical analysis about the convergence and moment matching behavior is provided along with a rigorous empirical verification. Sinkhorn distributional RL will lead to an important contribution among the research community.

A RELATED WORK

Based on the choice of distribution divergence and the distribution representation manner of Z θ , distributional RL algorithms can be mainly categorized into three classes, including categorical, Wasserstein Distance and MMD distributional RL. Finally, we discuss their relationships with our proposed SinkhornDRL. Categorical Distributional RL. As the first successful distributional RL family, categorical distributional RL (Bellemare et al., 2017a) represents the value distribution η by the categorical distribution η = N i=1 p i δ zi , where {z i } N i=1 (z 1 < ... < z N ) is the fixed supports within the pre-specified interval [l, u] and p i is the approximated categorical probability in each bin, respectively. Within this algorithm family, C51 (Bellemare et al., 2017a) leverages a neural network to approximate the categorical probabilities p i and apply a projected KL divergence between the target and current categorical value distributions. C51 has also been shown the theoretical contraction under the Cramér distance (Bellemare et al., 2017b; Rowland et al., 2018) , and empirically performs favorably in the suite of Atari games. Wasserstein Distance Distributional RL. As directly solving wasserstein distance in Eq. 16 is tricky, QR-DQN (Dabney et al., 2018b) firstly proposed to use quantile regression to approximate Wasserstein distance W p . QR-DQN leverages quantiles to represent the distribution η of Z θ , i.e., η = N i=1 δ zi , where {z i } N i=1 is the learnable support atoms as the quantile values of a fixed quantile { 2i-1 2N } N i=1 . Implicit Quantile Networks (IQN) utilizes an implicit model to output quantile values {z i } N i=1 more expressively, instead of the given ones in QRDQN. IQN also incorporates the risk measure in the framework of distributional RL. A follow-up work Fully parameterized Quantile Function (FQF) (Yang et al., 2019) improves IQN by proposing a more expressive quantile network, achieving better performance on Atari games. Non-crossing issue in quantile-regression has been raised and addressed properly in (Zhou et al., 2020) that further improves QR-DQN. The monotonic rational-quadratic splines are also used to learn smooth continuous quantile functions (Luo et al., 2021) . Discussion about SinkhornDRL. As a complementary Wasserstein distance-based distributional RL, our SinkhornDRL allows to solve Wasserstein Distance by incorporating the entropic regularization and circumstance the non-crossing issue in quantile regression intrinsically. Moreover, the cost function in SinkhornDRL can be further parameterized similar to IQN and FQF, which can intuitively achieve better performance. We leave the investigation towards this direction as future works. Meanwhile, SinkhornDRL is also closely linked with MMDDRL as Sinkhorn divergence interpolates between Wasserstein distance and MMD, and also learns the unrestricted statistics, i.e., samples, akin to MMDDRL.

B DEFINITION OF DISTANCES AND CONTRACTION

Definition of distances. Given two random variables X and Y , p-Wasserstein metric W p between the distributions of X and Y is defined as W p (X, Y ) = 1 0 F -1 X (ω) -F -1 Y (ω) p dω 1/p = F -1 X -F -1 Y p , which F -1 is the inverse cumulative distribution function of a random variable with the cumulative distribution function as F . Further, p distance (Elie & Arthur, 2020) is defined as p (X, Y ) := ∞ -∞ |F X (ω) -F Y (ω)| p dω 1/p = F X -F Y p The p distance and Wassertein metric are identical at p = 1, but are otherwise distinct. Note that when p = 2, p distance is also called Cramér distance (Bellemare et al., 2017b) d C (X, Y ). Also, the Cramér distance has a different representation given by d C (X, Y ) = E|X -Y | - 1 2 E |X -X | - 1 2 E |Y -Y | , where X and Y are the i.i.d. copies of X and Y . Energy distance (Székely, 2003; Ziel, 2020) is a natural extension of Cramér distance to the multivariate case, which is defined as d E (X, Y) = E X -Y - 1 2 E X -X - 1 2 E Y -Y , where X and Y are multivariate. Moreover, the energy distance is a special case of the maximum mean discrepancy (MMD), which is formulated as MMD(X, Y; k) = (E [k (X, X )] + E [k (Y, Y )] -2E[k(X, Y)]) 1/2 (20) where k(•, •) is a continuous kernel on X . In particular, if k is a trivial kernel, MMD degenerates to energy distance. Additionally, we further define the supreme MMD, which is a functional P(X ) S×A × P(X ) S×A → R defined as MMD ∞ (µ, ν) = sup (x,a)∈S×A MMD ∞ (µ(x, a), ν(x, a)) We further present the convergence rate under different distribution divergences. • T π is γ-contractive under the supreme form of Wassertein distance W p . • T π is γ 1/p -contractive under the supreme form of p distance. • T π is γ α/2 -contractive under MMD ∞ with the kernel k α (x, y) = -x -y α , ∀α > 0. Proof of Contraction. • 

C PROOF OF THEOREM 1

Proof. 1. ε → 0 and c = -k α It is obvious to observe that Sinkhorn loss degenerates to the wasserstein distance. We also have the conclusion that the distributional Bellman operator T π is γ-contractive under the supreme form of Wasserstein diatance, the proof of which is provided in Lemma 3 (Bellemare et al., 2017a). Since the above conclusion is made directly based on the limiting case when ε = 0, for an unspecified ε, we need a more rigorous proof. We show that their distance difference is at most an infinitesimal δ. Firstly, as W c,ε → W α and the regularization term is non-negative, using the language of (ε, δ) definition, we have: for ∀δ, there exists a small positive constant a, such that W c,ε -W α < δ when ≤ a. Based on that, we have the contraction conclusion: W ∞ -κα,ε (T π Z 1 , T π Z 2 ) = W ∞ -κα,ε (T π Z 1 , T π Z 2 ) -W ∞ α (T π Z 1 , T π Z 2 ) + W ∞ α (T π Z 1 , T π Z 2 ) ≤ δ + W ∞ α (T π Z 1 , T π Z 2 ), where the second term W ∞ α (T π Z 1 , T π Z 2 ) is contractive, and thus for the unspecified ε, the only difference from the limting ε = 0 is an infinitesimal δ, which will vanish as ε → 0 or a → 0. 2. ε → ∞. Our complete proof is inspired by (Ramdas et al., 2017; Genevay et al., 2018) . Recap the definition of squared MMD is E [k (X, X )] + E [k (Y, Y )] -2E[k(X, Y)] When the kernel function k degenerates to a unrectified k α (x, y) := -x -y α for α ∈ (0, 2), the squared MMD would degenerate to E X -X α + E Y -Y α -2E X -Y α On the other hand, we have the Sinkhorn loss as W c,∞ (µ, ν) = 2W c,∞ (µ, ν) -W c,∞ (ν, ν) -W c,∞ (µ, ν) Denoting Π ε be the unique minimizer for W c,ε , it holds that Π ε → µ ⊗ ν as ε → ∞. That being said, W c,∞ (µ, ν) → c(x, y)dµ(x)dν(y) + 0 = c(x, y)dµ(x)dν(y). If c = -k α = x -y α , we eventually have W -kα,∞ (µ, ν) → x -y α dµ(x)dν(y) = E X -Y α . Finally, we can have W -kα,∞ → 2E X -Y α -E X -X α -E Y -Y α which is exactly the form of squared MMD. Now the key is prove that Π ε → µ ⊗ ν as ε → ∞. Firstly, it is apparent that W c,ε (µ, ν) ≤ c(x, y)dµ(x)dν(y) as µ⊗ν ∈ Π(µ, ν). Let {ε k } be a positive sequence that diverges to ∞, and Π k be the corresponding sequence of minimizers for W c,ε . According to the optimality condition, it must be the case that c(x, y) dΠ k + ε k KL(Π k , µ ⊗ ν) ≤ c(x, y)dµ ⊗ ν + 0 (when Π(µ, ν) = µ ⊗ ν). Thus, KL (Π k , µ ⊗ ν) 1 ε k c dµ ⊗ ν -c dΠ k → 0. Besides, by the compactness of Π(µ, ν), we can extract a converging subsequence Π n k → Π ∞ . Since KL is weakly lower-semicontinuous, it holds that KL (Π ∞ , µ ⊗ ν) lim inf k→∞ KL (Π n k , µ ⊗ ν) = 0 Hence Π ∞ = µ⊗ν. That being said that the optimal coupling is simply the product of the marginals, indicating that Π ε → µ ⊗ ν as ε → ∞. As a special case, when α = 1, W -k1,∞ (u, v) is equivalent to the energy distance d E (X, Y) := 2E X -Y -E X -X -E Y -Y . ( ) In summary, if the cost function is the rectified kernel k α , it is the case that W -kα,ε converges to the squared MMD as ε → ∞. According to (Nguyen et al., 2020) , T π is γ α/2 -contractive in the supreme form of MMD with the rectified kernel k α . For the unspecified ε, we can get the similar result to the case of ε → 0. For ∀δ, there exists a large positive constant M , such that MMD 2 kα -W c,ε < δ when ≥ M . Based on that, we have the contraction conclusion: W ∞ -κα,ε (T π Z 1 , T π Z 2 ) = W ∞ -κα,ε (T π Z 1 , T π Z 2 ) -MMD 2 ∞ (T π Z 1 , T π Z 2 ) + MMD 2 ∞ (T π Z 1 , T π Z 2 ) ≤ MMD 2 ∞ (T π Z 1 , T π Z 2 ) -δ, ( ) where the first term MMD 2 ∞ (T π Z 1 , T π Z 2 ) is γ α 2 -contractive, and thus for the unspecified ε, the only difference from the limiting ε = ∞ is an infinitesimal δ, which will vanish as ε → +∞ or (M → +∞). 3. For ε ∈ (0, +∞), the contraction property needs a long proof. The proof pipeline is firstly we prove the three properties of Sinkhorn divergence, and then we show the contraction of distributional Bellman operator under Sinkhorn divergence based on its properties. Most importantly, we analyzed the contraction under a new non-constant factor. 3.1 Properties of Sinkhorn Divergence. We recap three crucial properties of a divergence metric. The first is scale sensitive (S) (of order β, β > 0), i.e., d p (cX, cY ) ≤ |c| β d p (X, Y ). The second property is shift invariant (I), i.e., d p (A + X, A + Y ) ≤ d p (X, Y ). The last one is unbiased gradient (U). A key observation for the analysis is that the Sinkhorn divergence would degenerate to a two-dimensional KL divergence, and therefore embraces a similar convergence behavior to KL divergence. Concretely, according to the equivalent form of W c,ε (µ, ν) in Eq. 9, it can be expressed as the KL divergence between an optimal joint distribution and a Gibbs distribution associated with the cost function: W c,ε (µ, ν) := KL (Π * (µ, ν)|K(µ, ν)) , (25) where Π * is the optimal joint distribution. Thus, the total Sinkhorn divergence is expressed as W c,ε (µ, ν) := 2KL (Π * (µ, ν)|K(µ, ν)) -KL (Π * (µ, µ)|K(µ, µ)) -KL (Π * (ν, ν)|K(ν, ν)) . ( ) Due to the form of W c,ε (µ, ν), the convergence behavior is determined by W c,ε (µ, ν), which is similar to the behavior of KL divergence. Thus, we will focus on the convergence analysis of W c,ε (µ, ν). According to the fact that KL divergence has unbiased gradient (U) and shift invariant (I), and Sinkkhorn divergence can be viewed as a two-dimensional KL divergence, both properties of U and I can be extended to Sinkhorn divergence. However, we find the non scale sensitive (S) property can not directly apply to Sinkhorn divergence due to the minimum nature of W c,ε (µ, ν) as the optimal joint distribution Π * (µ, ν) could be different from Π 0 (aµ, aν) where a is the scale factor. We need a new rigorous proof of scale sensitive property as follows. 3.2 Scale Sensitive Property of Sinkhorn Divergence. We show Sinkhorn divergence satisfies a variant of scale sensitive property when c = -k α that corresponds to a non-constant scale factor ∆(a, α) that is not only a function of the vanilla scale factor a and α in k α , but also the two specified probability measures (U, V ). By definition, the pdf of K(U, V ) ∝ e -c(x,y) ε µ(x)ν(y). After a scaling transformation, the pdf of aU and aV with respect to x and y would be 1 a µ( x a ) and 1 a ν( y a ). Thus K(aU, aV ) ∝ e -c(x,y) ε 1 a µ( x a ) 1 a ν( y a ). We denote Π * and Π 0 as the optimal joint distribution of W c,ε (µ, ν) and W c,ε (aµ, aν). W c,ε (aU, aV ) = c(x, y)dΠ 0 (x, y) + εKL(Π 0 |aµ ⊗ aν) ≤ c(x, y)dΠ * (x, y) + εKL(Π * |aµ ⊗ aν) c=-kα = (x -y) α 1 a 2 π * ( x a , y a )dxdy + ε 1 a 2 π * ( x a , y a ) log 1 a 2 π * ( x a , y a ) 1 a 2 µ( x a )ν( y a ) dxdy = |a| α (x -y) α π * (x, y)dxdy + ε π * (x, y) log π * (x, y) µ(x)ν(y) dxdy = (x -y) α π * (x, y)dxdy + εKL(Π * |µ ⊗ ν) -(1 -|a| α ) (x -y) α π * (x, y)dxdy = W c,ε (U, V ) -(1 -|a| α ) (x -y) α dΠ * (x, y) = ∆ U,V (a, α)W c,ε (U, V ) (27) where ∆ U,V (a, α) = 1 - (1-|a| α ) (x-y) α dΠ * (x,y) Wc,ε(U,V ) ∈ (0, 1) for ε ∈ (0, +∞) and a < 1 due to the fact that 0 < (1 -|a| α ) (x -y) α dΠ * (x, y) < (x -y) α dΠ * (x, y) < W c,ε (U, V ). ∆ U,V (a, α) is function less than 1 that depends on the two margin distributions and the scale factor a. The result implies that we have a new variant of scale sensitive property of Sinkhorn divergence with a non-constant factor ∆ U,V (a, α) < 1 when we choose c = -k α and |a| < 1.

3.3. A New Contraction Mapping Theorem.

We derive a new contraction mapping theorem based on the distribution distance d in order to prove the convergence in 3.4. Theorem 2. (Distribution Contraction Mapping Theorem with a Non-constant Factor) Consider a distribution distance d and a function g : P → P. The mapping d is a contraction: There exists a function q(X, Y ) < 1 such that for ∀ distributions X and Y : d(g(X), g(Y )) ≤ q(X, Y )d(X, Y ) Then there exists a unique distribution X * with g(X * ) = X * . Proof. We consider the convergence of the distribution sequence X k . We have the updating rule as d(X k+1 , X k ) = d(g(X k ), g(X k-1 )) ≤ q k,k-1 d(X k , X k-1 ), where we use q k,k-1 = q(X k , X k-1 ) for short. Hence, we have d(X k+1 , X k ) ≤ Π k i=1 q i,i-1 d(X 1 , X 0 ). Let d 0 = d(X 1 , X 0 ). From the triangle inequality, we have d(X k+l , X k ) ≤ d(X k+1 , X k ) + ... + d(X k+l , X k+l-1 ), ≤ Π k i=1 i,i-1 d 0 + .. + Π k+l-1 i=1 q i,i-1 d 0 ≤ Π k i=1 q i,i-1 (1 + q k+1,k + ... + Π k+l-1 i=k+1 q i,i-1 )d 0 ≤ Π k i=1 q i,i-1 (1 + q k+1,k + ... + Π k+l-1 i=k+1 q i,i-1 + ...)d 0 For the infinite series 1 + q k+1,k + ... + Π k+l-1 i=k+1 q i,i-1 + ..., which we denote as u i for i-the term, according to the ratio convergence judgment method of infinite series, lim k→∞ ui+1 ui < 1. Thus, the infinite series is convergent. Due to the fact Π k i=1 q i,i-1 → 0 as k → ∞, we have d(X k+1 , X k ) → 0 as k → ∞. Therefore, it must converge to a limit distribution X * that satisfies g(X * ) = X * .

3.4. Contraction of Distributional Bellman Operator under Sinkhorn Divergence.

According to the equation of W c,ε , it holds the same properties as W c,ε , i.e., shift invariant and scale sensitive. Thus, we derive the convergence of distributional Bellman operator T π under the supreme form of W c,ε , i.e., W ∞ c,ε : W ∞ c,ε (T π Z 1 , T π Z 2 ) = sup s,a W c,ε (T π Z 1 (s, a), T π Z 2 (s, a)) = W c,ε (R(s, a) + γZ 1 (s , a ), R(s, a) + γZ 2 (s , a )) c=-kα ≤ ∆ Z1(s ,a ),Z2(s ,a ) (γ, α)W c,ε (Z 1 (s , a ), Z 2 (s , a )) ≤ sup s ,a ∆ Z1(s ,a ),Z2(s ,a ) (γ, α) sup s ,a W c,ε (Z 1 (s , a ), Z 2 (s , a )) ≤ ∆ Z1,Z2 (γ, α) sup s ,a W -kα,ε (Z 1 (s , a ), Z 2 (s , a )) = ∆ Z1,Z2 (γ, α)W ∞ -kα,ε (Z 1 , Z 2 ) (32) where the first inequality comes from the scale sensitive property proof of Sinkhorn divergence and we let ∆ Z1,Z2 (γ, α) = sup s ,a ∆ Z1(s ,a ),Z2(s ,a ) (γ). If ∆ Z1,Z2 (γ, α) is only a constant function in terms of γ and α, we can directly arrive the conclusion that distributional Bellman operator is ∆ Z1,Z2 (γ, α)-contractive based on existing Banach fixed point theorem. However, the fact is that ∆ Z1,Z2 (γ, α) also depends on Z 1 and Z 2 , and thus we need a new contraction mapping theorem to guarantee the convergence of fixed distribution iteration. According to Theorem 2 in 3.3 that we specifically figure out for the our contraction proof, we have W ∞ c,ε can guarantee the convergence via distributional Bellman iterations. In summary, we conclude that T π is a contractive operator when we use the -k α as the cost function and γ ≤ 1, while the contraction factor, which is short for ∆(γ, α) < 1, is not only a function of α and γ, but also depends on distribution sequence in the while iterations.

D PROOF OF PROPOSITION 1 AND COROLLARY 1

Proof. As we leverage Π * to denote the optimal Π by evaluating the Sinkhorn divergence via min Π∈Π(µ,ν) W c,ε (µ, ν; k), the Sinkhorn divergence can be composed in the following form: W c,ε (µ, ν; k) = 2KL (Π * (µ, ν)|K -k (µ, ν)) -KL (Π * (µ, µ)|K -k (µ, µ)) -KL (Π * (ν, ν)|K -k (ν, ν)) = 2(E X,Y [log Π * (µ, ν)]) + 1 ε E X,X [c(X, Y )]) -(E X,X [log Π * (µ, ν)]) + 1 ε E X,Y [c(X, Y )]) -(E Y,Y [log Π * (ν, ν)]) + 1 ε E Y,Y [c(Y, Y )]) = E X,X ,Y,Y log (Π * (X, Y )) 2 Π * (X, X )Π * (Y, Y ) + 1 ε (E X,X [k(X, X )] + E Y,Y [k(Y, Y )] -2E X,X [k(X, Y )]) = E X,X ,Y,Y log (Π * (X, Y )) 2 Π * (X, X )Π * (Y, Y ) + 1 ε MMD 2 -c (µ, ν) ) where the cost function c in the Gibbs distribution K is minus Gaussian kernel, i.e., c(x, y) = -k(x, y) = e -(x-y)/(2σ 2 ) . Till now, we have shown the result in Corollary 1. Next, we use Taylor expansion to prove the moment matching of MMD. Firstly, we have the following equation: MMD 2 -c (µ, ν) = E X,X [k(X, X )] + E Y,Y [k(Y, Y )] -2E X,X [k(X, Y )] = E X,X φ(X) φ(X ) + E Y,Y φ(Y ) φ(Y ) -2E X,X φ(X) φ(Y ) = E φ(X) -φ(Y ) 2 We expand the Gaussian kernel via Taylor expansion, i.e., k(x, y) = e -(x-y) 2 /(2σ 2 ) = e -x 2 2σ 2 e -y 2 2σ 2 e xy σ 2 = e -x 2 2σ 2 e -y 2 2σ 2 ∞ n=0 1 √ n! ( x σ ) n 1 √ n! ( y σ ) n = ∞ n=0 e -x 2 2σ 2 1 √ n! ( x σ ) n e -y 2 2σ 2 1 √ n! ( y σ ) n = φ(x) φ(y) Therefore, we have MMD 2 -c (µ, ν) = ∞ n=0 1 σ 2n n! E x∼µ e -x 2 /(2σ 2 ) x n -E x∼ν e -y 2 /(2σ 2 ) y n 2 = ∞ n=0 1 σ 2n n! Mn (µ) -Mn (ν) 2 Mn (µ) = E x∼µ e -x 2 /(2σ 2 ) x n , and similarly for Mn (ν). The conclusion is the same as the moment matching in (Nguyen et al., 2020) . Finally, due to the equivalence of W c,ε (µ, ν) after multiplying ε, we have W c,ε (µ, ν; k) := MMD 2 -c (µ, ν) + εE (Π * (X, Y )) 2 Π * (X, X )Π * (Y, Y ) = ∞ n=0 1 σ 2n n! Mn (µ) -Mn (ν) 2 + εE (Π * (X, Y )) 2 Π * (X, X )Π * (Y, Y ) , This result is also equivalent to Theorem 1, where Π * would degenerate to µ ⊗ ν as ε → +∞. In that case, the first regularization term would vanish, and thus the Sinkhorn divergence degrades to a MMD loss, i.e., MMD 2 -c (µ, ν). Human normalized score equation is (algorithm -randomplay) / (human -randomplay). Our implementation of DQN, QRDQN-1, C51, MMDDRL, Sinkhorn is based on (Zhang, 2018) and all the experimental settings, including parameters are identical to the distributional RL baselines implemented by (Zhang, 2018) . The results about mean and median human-normalized scores of all considered distributional RL algorithms are reported in Table 2 . We also compare the performance of QRDQN(tf) and MMDDRL(tf) after the same 10M time steps based on tensorflow implementation on Dopamine framework (Castro et al., 2018) . These results are averaged over training data released in (Nguyen-Tang, 2021) . We argue that Human-normalized scores may be limited to evaluate the superiority of algorithms as mean can be highly affected by the performance on games with high-level returns. For example, in Figure 2 , MMDDRL is superior to QR-DQN as Sinkhorn outperforms MMDDRL on a smaller portion of games compared with QRDQN, but the mean score for MMDDRL in Table 2 is lower than QRDQN. By contrast, SinkhornDRL is superior in terms of mean score and competitive in terms of median. We also provide all average results in Table 3 of Appendix H and all learning curves of our implemented algorithms in Appendix F. Iteration in Algorithm 2. From Figure 10 (a), we can observe that if we gradually decline ε to 0, SinkhornDRL's performance tends to degrade and approach to QR-DQN. Note that an overly small ε will lead to a trivial almost 0 K i,j in Sinkhorn iteration in Algorithm 2, and will cause 1 0 numerical instability issue for a l and b l in Line 5 of Algorithm 2. In addition, we also conducted experiments on Seaquest, the similar result is also observed in Figure 10 (d) . As shown in Figure 10 (d) , the performance of SinkhornDRL is robust when ε = 10, 100, 500, but a small = 1 tends to worsen the performance.

E HUMAN-NORMALIZED SCORES

Increasing ε. Moreover, for breakout, if we increase ε, the performance of SinkhornDRL tends to degrade and be close to MMDDRL as suggested in Figure 10 (b) . It is also noted that an overly large ε will let the K i,j explode to ∞. This also leads to numerical instability issue in Sinkhorn iteration in Algorithm 2. Samples N . We find that SinkhornDRL requires a proper number of samples N to perform favorably, and the sensitivity w.r.t N depends on the environment. As suggested in Figure 11 (a), a smaller N , e.g., N = 2 on breakout has already achieved favorably performance and even accelerates the convergence in the early phase, while N = 2 on Seaquest will lead to the divergence issue. Meanwhile, an overly large N worsens the performance across two games. We conjecture that using larger network networks to generate more samples may suffer from the overfitting issue, yielding the training instability (Bjorck et al., 2021) . In practice, we choose a proper number of sample, i.e., N = 200 across all games. 

G.2 COMPARISON WITH THE COMPUTATIONAL COST

We evaluate the computational time every 10,000 iterations across the whole training process of all considered distributional RL algorithms and make a comparison in Figure 12 . It suggests that SinkhornDRL indeed increases around 50% computation cost compared with QR-DQN and C51, but only slightly increases the the cost in contrast to MMDDRL on both Breakout and Qbert games. We argue that this additional computational burden can be tolerant in view of the significant outperformance of SinkhornDRL in a large amount of environments. In addition, we also find that the number of Sinkhorn iterations L is negligible to the computation cost, while an overly samples N , e.g., 500, will lead to a large computational burden as illustrated in Figure 13 . This can be intuitively explained as the computation complexity of the cost function c i,j is O(N 2 ) in SinkhornDRL, which is particularly heavy in computation if N is large enough. 



, aimed at approximating Wasserstein distance, suffers from Distributional Bellman operators. For the policy evaluation in expectation-based RL, the actionvalue function is updated via Bellman operator T π Q(s, a) = E[R(s, a)] + γE s ∼p,π [Q (s , a )]. In distributional RL, the distribution of Z π (s, a) is updated via the distributional Bellman operator T π T π Z(s, a) : D = R(s, a) + γZ (s , a ) ,

Figure 1: Learning curves of SinkhornDRL algorithm compared with DQN, C51, QR-DQN and MMD, on nine typical Atari games over 3 seeds.

Figure 3: Sensitivity analysis of SinkhornDRL on Breakout regarding ε, number of samples, and number of iteration L. Learning curves are reported over 3 seeds.

Distributional RL. MMD distributional RL (MMDDRL) (Nguyen et al., 2020) learns samples to represent the value distribution of Z θ based on maximum mean discrepancy (MMD) in Eq. 20, achieving the state-of-the-art performance on Atari games.

Contraction under supreme form of Wasserstein diatance is provided in Lemma 3 (Bellemare et al., 2017a). • Contraction under supreme form of p distance can refer to Theorem 3.4 (Elie & Arthur, 2020). • Contraction under MMD ∞ is provided in Lemma 6 (Nguyen et al., 2020).

Figure 8: Performance of SinkhornDRL compared with DQN, C51, QRDQN and MMD on Bowling, Boxing, DoubleDunk, Freeway, Gravitar, Kangaroo, Krull, KunFuMaster and MontezumaRevenge.

Figure 11: Sensitivity analysis of Sinkhorn in terms of the number of samples N on Breakout (a) and Seaquest (b).

Figure 12: Average computational cost per 10,000 iterations of all considered distributional RL algorithm, where we select ε = 10, L = 10 and number of samples N = 200 in SinkhornDRL algorithm.

). Generic Sinkhorn distributional RL Update Require: Number of generated samples N , the cost function c and hyperparameter ε.

Mean and median of best human-normalized scores across 55 Atari 2600 games. The results for all considered algorithms are averaged over 3 seeds. DQN, C51, QR-DQN-1, MMDDRL are based on our Pytorch implementation after 10M time steps adapted from (Zhang, 2018), while QRDQN(tf) and MMDDRL(tf) are training results also after 10M time steps from the tensorflow Dopamine framework of MMDDRL (Nguyen et al., 2020) released in (Nguyen-Tang, 2021).

Scores of all algorithms averaged over 3 seeds across 55 Atari games.

annex

Ethics Statement. Our study is about the design of distributional RL algorithms, which is not involved with any ethics issue.Reproducibility Statement. Our results is based on the public implementation released in (Zhang, 2018) with necessary implementation details given in Appendix F. We also provide the detailed proof from Appendix C to Appendix D.

F MORE EXPERIMENTAL RESULTS

We provide learning curves of DQN, QRDQN, C51, MMD and SinkhornDRL algorithms on all 55 Atari games in Figures 4 5 6 7 8 9. It illustrates that SinkhornDRL dramatically surpasses the other distributional RL algorithms on a large amount of environments, e.g., Venture, Atlantis, Tennis and SpaceInvader, and presents competitive performance or is only slightly inferior as opposed to the state-of-the-art baselines on other games. Note that the average improvement of SinkhornDRL on Venture game is significant owing to one to two times convergence of SinkhornDRL algorithm over 3 seeds, while the other baselines do not converge over the considered seeds. Although this improvement may also suffer from the instability issue, its occasional success for our SinkhornDRL algorithm also presents huge potential on some complicated environments. We leave the further exploration on the advantage and potential of SinkhornDRL algorithm as the future work.

G SENSITIVITY ANALYSIS AND COMPUTATIONAL COST G.1 MORE RESULTS IN SENSITIVITY ANALYSIS

Decreasing ε. We argue that the limit behavior connection as stated in Theorem 1 may not be able to be verified rigorously via numeral experiments due to the numerical instability of Sinkhorn 

