DISTRIBUTIONAL REINFORCEMENT LEARNING VIA SINKHORN ITERATIONS

Abstract

Distributional reinforcement learning (RL) is a class of state-of-the-art algorithms that estimate the entire distribution of the total return rather than only its expectation. The empirical success of distributional RL is determined by the representation of return distributions and the choice of distribution divergence. In this paper, we propose a new class of Sinkhorn distributional RL (SinkhornDRL) algorithm that learns a finite set of statistics, i.e., deterministic samples, from each return distribution and then uses Sinkhorn iterations to evaluate the Sinkhorn distance between the current and target Bellmen distributions. Sinkhorn divergence features as the interpolation between the Wasserstein distance and Maximum Mean Discrepancy (MMD). SinkhornDRL finds a sweet spot by taking advantage of the geometry of optimal transport-based distance and the unbiased gradient estimate property of MMD. Finally, compared to state-of-the-art algorithms, Sinkhorn-DRL's competitive performance is demonstrated on the suite of 55 Atari games.

1. INTRODUCTION

Classical reinforcement learning (RL) algorithms are normally based on the expectation of discounted cumulative rewards that an agent observes while interacting with the environment. Recently, a new class of RL algorithms called distributional RL estimates the full distribution of total returns and has exhibited the state-of-the-art performance in a wide range of environments (Bellemare et al., 2017a; Dabney et al., 2018b; a; Yang et al., 2019; Zhou et al., 2020; Nguyen et al., 2020) . In distributional RL literature, it is easily recognized that algorithms based on either Wasserstein distance or MMD have gained great attention due to their superior performance. Their mutual connection from the perspective of mathematical properties intrigues us to explore further in order to design new algorithms. Particularly, Wasserstein distance, long known to be a powerful tool to compare probability distributions with non-overlapping supports, has recently emerged as an appealing contender in various machine learning applications. It is known that Wasserstein distance was long disregarded because of its computational burden in its original form to solve an expensive network flow problem. However, recent works (Sinkhorn, 1967; Genevay et al., 2018) have shown that this cost can be largely mitigated by settling for cheaper approximations through strongly convex regularizers. The benefit of this regularization has opened the path to wider applications of the Wasserstein distance in relevant learning problems, including the design of distributional RL algorithms. The Sinkhorn divergence (Sinkhorn, 1967) introduces the entropic regularization on the Wassertein distance, allowing it tractable for the evaluation especially in high-dimensions. It has been successfully applied in numerous crucial machine learning developments, including the Sinkhorn-GAN (Genevay et al., 2018) and Sinkhorn-based adversarial training (Wong et al., 2019) . More importantly, it has been shown that Sinkhorn divergence interpolates Wasserstein ditance and MMD, and their equivalence form can be well established in the limit cases (Feydy et al., 2019; Ramdas et al., 2017; Nguyen et al., 2020) . However, a Sinkhorn-based distributional RL algorithm has not yet been formally proposed and its connection with algorithms based on Wasserstein distance and MMD is also less studied. Therefore, a natural question is can we design a new class of distributional RL algorithms via Sinkhorn divergence, thus bridging the gap between existing two main branches of distributional RL algorithms? Moreover, the dominant quantile regression-based algorithms, e.g., QR-DQN (Dabney et al., 2018b) , aimed at approximating Wasserstein distance, suffers from

