DISTRIBUTIONAL REINFORCEMENT LEARNING VIA SINKHORN ITERATIONS

Abstract

Distributional reinforcement learning (RL) is a class of state-of-the-art algorithms that estimate the entire distribution of the total return rather than only its expectation. The empirical success of distributional RL is determined by the representation of return distributions and the choice of distribution divergence. In this paper, we propose a new class of Sinkhorn distributional RL (SinkhornDRL) algorithm that learns a finite set of statistics, i.e., deterministic samples, from each return distribution and then uses Sinkhorn iterations to evaluate the Sinkhorn distance between the current and target Bellmen distributions. Sinkhorn divergence features as the interpolation between the Wasserstein distance and Maximum Mean Discrepancy (MMD). SinkhornDRL finds a sweet spot by taking advantage of the geometry of optimal transport-based distance and the unbiased gradient estimate property of MMD. Finally, compared to state-of-the-art algorithms, Sinkhorn-DRL's competitive performance is demonstrated on the suite of 55 Atari games.

1. INTRODUCTION

Classical reinforcement learning (RL) algorithms are normally based on the expectation of discounted cumulative rewards that an agent observes while interacting with the environment. Recently, a new class of RL algorithms called distributional RL estimates the full distribution of total returns and has exhibited the state-of-the-art performance in a wide range of environments (Bellemare et al., 2017a; Dabney et al., 2018b; a; Yang et al., 2019; Zhou et al., 2020; Nguyen et al., 2020) . In distributional RL literature, it is easily recognized that algorithms based on either Wasserstein distance or MMD have gained great attention due to their superior performance. Their mutual connection from the perspective of mathematical properties intrigues us to explore further in order to design new algorithms. Particularly, Wasserstein distance, long known to be a powerful tool to compare probability distributions with non-overlapping supports, has recently emerged as an appealing contender in various machine learning applications. It is known that Wasserstein distance was long disregarded because of its computational burden in its original form to solve an expensive network flow problem. However, recent works (Sinkhorn, 1967; Genevay et al., 2018) have shown that this cost can be largely mitigated by settling for cheaper approximations through strongly convex regularizers. The benefit of this regularization has opened the path to wider applications of the Wasserstein distance in relevant learning problems, including the design of distributional RL algorithms. The Sinkhorn divergence (Sinkhorn, 1967) introduces the entropic regularization on the Wassertein distance, allowing it tractable for the evaluation especially in high-dimensions. It has been successfully applied in numerous crucial machine learning developments, including the Sinkhorn-GAN (Genevay et al., 2018) and Sinkhorn-based adversarial training (Wong et al., 2019) . More importantly, it has been shown that Sinkhorn divergence interpolates Wasserstein ditance and MMD, and their equivalence form can be well established in the limit cases (Feydy et al., 2019; Ramdas et al., 2017; Nguyen et al., 2020) . However, a Sinkhorn-based distributional RL algorithm has not yet been formally proposed and its connection with algorithms based on Wasserstein distance and MMD is also less studied. Therefore, a natural question is can we design a new class of distributional RL algorithms via Sinkhorn divergence, thus bridging the gap between existing two main branches of distributional RL algorithms? Moreover, the dominant quantile regression-based algorithms, e.g., QR-DQN (Dabney et al., 2018b) , aimed at approximating Wasserstein distance, suffers from the non-crossing issue in the quantile estimation (Zhou et al., 2020) , while sample-based Sinkhorn distributional algorithm can naturally circumvent this problem. In this paper, we propose a novel distributional RL family based on Sinkhorn divergence. Firstly, we show key roles of distribution divergence and value distribution representation in the design of distributional RL algorithms. After a detailed introduction of our proposed SinkhornDRL algorithm, with a non-trivial proof, we theoretically analyze the convergence property of distributional Bellman operators under Sinkhorn divergence. A regularized MMD equivalence with Sinkhorn divergence is also established, interpreting its empirical success in real applications. Finally, we compare the performance of our SinkhornDRL algorithm with typical baselines on 55 Atari games, verifying the competitive performance of our proposal. Our method inspires researchers to find a trade-off that simultaneously leverages the geometry of the Wasserstein distance and the favorable unbiased gradient estimate property of MMD while designing new distributional RL algorithms in the future.

2.1. DISTRIBUTIONAL REINFORCEMENT LEARNING

In classical RL, an agent interacts with an environment via a Markov decision process (MDP), a 5-tuple (S, A, R, P, γ), where S and A are the state and action spaces, respectively. P is the environment transition dynamics, R is the reward function and γ ∈ (0, 1) is the discount factor. From Value function to Value distribution. Given a policy π, the discounted sum of future rewards is a random variable Z π (s, a) = ∞ t=0 γ t R(s t , a t ), where s 0 = s, a 0 = a, s t+1 ∼ P (•|s t , a t ), and a t ∼ π(•|s t ). In the control setting, expectation-based RL is based on the action-value function Q π (s, a), which is the expectation of Z π (s, a), i.e., Q π (s, a) = E [Z π (s, a)]. By contrast, distributional RL focuses on the action-value distribution, the full distribution of Z π (s, a). The incorporation of additional distributional knowledge intuitively interprets its empirical success. 

Distributional

where s ∼ P (•|s, a) and a ∼ π (•|s ). The equality implies random variables of both sides are equal in distribution. The distributional Bellman operator T π is contractive under certain distribution divergence metrics. We provide a detailed discussion about more related works in Appendix A.

2.2. DIVERGENCES BETWEEN MEASURES

Optimal Transport (OT) and Wasserstein Distance. The optimal transport (OT) metric between two probability measures (µ, ν) is defined as the solution of the linear program: min Π∈Π(µ,ν) c(x, y)dΠ(x, y), where c is the cost function and Π is the joint distribution with marginals (µ, ν). Wasserstein distance (a.k.a. earth mover distance) is a special case of optimal transport with the Euclidean norm as the cost function. In particular, given two scalar random variables X and Y , p-Wasserstein metric W p between the distributions of X and Y can be simplified as W p (X, Y ) = 1 0 F -1 X (ω) -F -1 Y (ω) p dω 1/p , where F -1 is the inverse cumulative distribution function of a random variable. The desirable geometric property of Wasserstein distance allows it to recover full support of measures, but it suffers from the curse of dimension (Genevay et al., 2019; Arjovsky et al., 2017) . Maximum Mean Discrepancy. The squared Maximum Mean Discrepancy (MMD) MMD 2 k with the kernel k is formulated as MMD 2 k = E [k (X, X )] + E [k (Y, Y )] -2E [k(X, Y )] ,



Bellman operators. For the policy evaluation in expectation-based RL, the actionvalue function is updated via Bellman operator T π Q(s, a) = E[R(s, a)] + γE s ∼p,π [Q (s , a )]. In distributional RL, the distribution of Z π (s, a) is updated via the distributional Bellman operator T π T π Z(s, a) : D = R(s, a) + γZ (s , a ) ,

