A DISTRIBUTIONAL PERSPECTIVE ON ACTOR-CRITIC FRAMEWORK

Abstract

Recent distributional reinforcement learning methods, despite their successes, still contain fundamental problems that can lead to inaccurate representations of value distributions, such as distributional instability, action type restriction, and conflation between samples and statistics. In this paper, we present a novel distributional actor-critic framework, GMAC, to address such problems. Adopting a stochastic policy removes the first two problems, and the conflation in the approximation is alleviated by minimizing the Cramér distance between the value distribution and its Bellman target distribution. In addition, GMAC improves data efficiency by generating the Bellman target distribution through the Sample-Replacement algorithm, denoted by SR(λ), which provides a distributional generalization of multi-step policy evaluation algorithms. We empirically show that our method captures the multimodality of value distributions and improves the performance of a conventional actor-critic method with low computational cost in both discrete and continuous action spaces, using Arcade Learning Environment (ALE) and PyBullet environment.

1. INTRODUCTION

The ability to learn complex representations via neural networks has enjoyed success in various applications of reinforcement learning (RL), such as pixel-based video gameplays (Mnih et al., 2015) , the game of Go (Silver et al., 2016 ), robotics (Levine et al., 2016) , and high dimensional controls like humanoid robots (Lillicrap et al., 2016; Schulman et al., 2015) . Starting from the seminal work of Deep Q-Network (DQN) (Mnih et al., 2015) , the advance in value prediction network, in particular, has been one of the main driving forces for the breakthrough. Among the milestones of the advances in value function approximation, distributional reinforcement learning (DRL) further develops the scalar value function to a distributional representation. The distributional perspective offers various benefits by providing more information on the characteristics and the behavior of the value. One such benefit is the preservation of multimodality in value distributions, which leads to more stable learning of the value function (Bellemare et al., 2017a) . Despite the development, several issues remain, hindering DRL from becoming a robust framework. First, a theoretical instability exists in the control setting of value-based DRL methods (Bellemare et al., 2017a) . Second, previous DRL algorithms are limited to a single type of action space, either discrete (Bellemare et al., 2017a; Dabney et al., 2018b; a) or continuous (Barth-Maron et al., 2018; Singh et al., 2020) . Third, a common choice of loss function is Huber quantile regression loss which is vulnerable to conflation between samples and statistics without an imputation strategy (Rowland et al., 2019) . The instability issue is not present if a trainable policy is introduced, i.e., the evaluation setting of the Bellman operator is used, as shown by the convergence of distributional Bellman operator (Bellemare et al., 2017a) . In addition, the general form of the stochastic policy gradient method does not assume a specific type of action space, e.g. discrete or continuous (Williams, 1988; 1992; Sutton et al., 1999) . Because Wasserstein distance has biased sample gradients (Bellemare et al., 2017b) , in practice, directly minimizing the Wasserstein distance is often not preferred as a loss function of a neural network and thus some of the exemplary works (Dabney et al., 2018b; a) of deep DRL minimizes the Huber quantile regression loss (Huber, 1964) instead. To this end, we treat the methods which minimize the Huber quantile loss as our baseline. However, as proven in Rowland

