A DISTRIBUTIONAL PERSPECTIVE ON ACTOR-CRITIC FRAMEWORK

Abstract

Recent distributional reinforcement learning methods, despite their successes, still contain fundamental problems that can lead to inaccurate representations of value distributions, such as distributional instability, action type restriction, and conflation between samples and statistics. In this paper, we present a novel distributional actor-critic framework, GMAC, to address such problems. Adopting a stochastic policy removes the first two problems, and the conflation in the approximation is alleviated by minimizing the Cramér distance between the value distribution and its Bellman target distribution. In addition, GMAC improves data efficiency by generating the Bellman target distribution through the Sample-Replacement algorithm, denoted by SR(λ), which provides a distributional generalization of multi-step policy evaluation algorithms. We empirically show that our method captures the multimodality of value distributions and improves the performance of a conventional actor-critic method with low computational cost in both discrete and continuous action spaces, using Arcade Learning Environment (ALE) and PyBullet environment.

1. INTRODUCTION

The ability to learn complex representations via neural networks has enjoyed success in various applications of reinforcement learning (RL), such as pixel-based video gameplays (Mnih et al., 2015) , the game of Go (Silver et al., 2016) , robotics (Levine et al., 2016) , and high dimensional controls like humanoid robots (Lillicrap et al., 2016; Schulman et al., 2015) . Starting from the seminal work of Deep Q-Network (DQN) (Mnih et al., 2015) , the advance in value prediction network, in particular, has been one of the main driving forces for the breakthrough. Among the milestones of the advances in value function approximation, distributional reinforcement learning (DRL) further develops the scalar value function to a distributional representation. The distributional perspective offers various benefits by providing more information on the characteristics and the behavior of the value. One such benefit is the preservation of multimodality in value distributions, which leads to more stable learning of the value function (Bellemare et al., 2017a) . Despite the development, several issues remain, hindering DRL from becoming a robust framework. First, a theoretical instability exists in the control setting of value-based DRL methods (Bellemare et al., 2017a) . Second, previous DRL algorithms are limited to a single type of action space, either discrete (Bellemare et al., 2017a; Dabney et al., 2018b; a) or continuous (Barth-Maron et al., 2018; Singh et al., 2020) . Third, a common choice of loss function is Huber quantile regression loss which is vulnerable to conflation between samples and statistics without an imputation strategy (Rowland et al., 2019) . The instability issue is not present if a trainable policy is introduced, i.e., the evaluation setting of the Bellman operator is used, as shown by the convergence of distributional Bellman operator (Bellemare et al., 2017a) . In addition, the general form of the stochastic policy gradient method does not assume a specific type of action space, e.g. discrete or continuous (Williams, 1988; 1992; Sutton et al., 1999) . Because Wasserstein distance has biased sample gradients (Bellemare et al., 2017b) , in practice, directly minimizing the Wasserstein distance is often not preferred as a loss function of a neural network and thus some of the exemplary works (Dabney et al., 2018b; a) of deep DRL minimizes the Huber quantile regression loss (Huber, 1964) instead. To this end, we treat the methods which minimize the Huber quantile loss as our baseline. However, as proven in Rowland 2019), representing a distribution using the Huber quantiles instead of samples can lead to conflation between statistics and samples, and thus an imputation strategy is required to learn a more accurate representation of a distribution. Despite its theoretical soundness, the imputation strategy can introduce computational overheads depending on the statistics (e.g. quantiles, expectiles). To avoid this issue, we formulate the DRL problem through samples and parameters instead of statistics, and minimize the Cramér distance between the value distributions. Combining these solutions, we arrive at a distributional actor-critic with Cramér distance as the value distribution loss. On the other hand, many actor-critic methods suffer from data inefficiency and often include multistep algorithms like the λ-return algorithm (Watkins, 1989) and off-policy updates, e.g. importance sampling. Here we address this problem by adapting multi-step off-policy updates to the distributional perspective, by defining a generalized form of the multi-step distributional Bellman targets. Furthermore, we introduce a novel value-distribution learning method which we call Sample-Replacement, denoted by SR(λ). We show that the expectation of the target distribution from SR(λ) is equivalent to the scalar λ-return. Additionally, we propose to parameterize the value distribution as a Gaussian mixture model (GMM). When combining GMM with the Cramér distance, we can derive an analytic solution and obtain unbiased sample gradients at a much lower computational cost compared to the method using Huber quantile loss. Altogether, we call our framework GMAC (Gaussian mixture actor-critic). We present experimental results to demonstrate that this framework successfully outperforms the baseline algorithm with scalar value function in discrete action space, and can be expanded to continuous action spaces without any architectural or algorithmic modification. Furthermore, we show that more accurate representation of value distributions is learned, with a less computational cost. 



Figure 1: Modality of value distribution during the learning process of Breakout-v4. (a) An arrow is added in the inset to indicate the ball's direction of travel. The episode reaches a terminal state if the paddle misses the ball. (b) Probability density functions of the value distributions learned by each actor-critic when {0.2, 0.4, 1.2}M frames are collected. As the policy improves, the probability of losing a turn (V = 0) decreases, and the probability of earning scores (V > 0) increases. Note that the modality transition from V = 0 is clearly captured by the GMM + Cramér method.

Bellemare et al. (2017a)  has shown that the distributional Bellman operator derived from the distributional Bellman equation is a contractor in a maximal form of the Wasserstein distance. Based on this point,Bellemare et al. (2017a)  proposed a categorical distributional model, C51, which is later discussed to be minimizing the Cramér distance in the projected distributional space(Rowland et al., 2018; Bellemare et al., 2019).Dabney et al. (2018b)  proposed quantile regression-based models, QR-DQN, which parameterizes the distribution with a uniform mixture of Diracs and uses sample-based Huber quantile loss(Huber, 1964).Dabney et al. (2018a)  later expand it further so that a full continuous quantile function can be learned through the implicit quantile network (IQN). Yang et al. (2019) then further improved the approximation of the distribution by adjusting the set of quantiles. Rowland et al. (2019) proposed expectile regression in place of quantile regression for learning categorical distribution to address the error in the Bellman target approximation. Choi et al. (2019) suggested parameterizing the value distribution using Gaussian mixture and minimizing the Tsallis-Jenson divergence as the loss function on a value-based method. Outside of RL, Bellemare

