A DISTRIBUTIONAL PERSPECTIVE ON ACTOR-CRITIC FRAMEWORK

Abstract

Recent distributional reinforcement learning methods, despite their successes, still contain fundamental problems that can lead to inaccurate representations of value distributions, such as distributional instability, action type restriction, and conflation between samples and statistics. In this paper, we present a novel distributional actor-critic framework, GMAC, to address such problems. Adopting a stochastic policy removes the first two problems, and the conflation in the approximation is alleviated by minimizing the Cramér distance between the value distribution and its Bellman target distribution. In addition, GMAC improves data efficiency by generating the Bellman target distribution through the Sample-Replacement algorithm, denoted by SR(λ), which provides a distributional generalization of multi-step policy evaluation algorithms. We empirically show that our method captures the multimodality of value distributions and improves the performance of a conventional actor-critic method with low computational cost in both discrete and continuous action spaces, using Arcade Learning Environment (ALE) and PyBullet environment.

1. INTRODUCTION

The ability to learn complex representations via neural networks has enjoyed success in various applications of reinforcement learning (RL), such as pixel-based video gameplays (Mnih et al., 2015) , the game of Go (Silver et al., 2016) , robotics (Levine et al., 2016) , and high dimensional controls like humanoid robots (Lillicrap et al., 2016; Schulman et al., 2015) . Starting from the seminal work of Deep Q-Network (DQN) (Mnih et al., 2015) , the advance in value prediction network, in particular, has been one of the main driving forces for the breakthrough. Among the milestones of the advances in value function approximation, distributional reinforcement learning (DRL) further develops the scalar value function to a distributional representation. The distributional perspective offers various benefits by providing more information on the characteristics and the behavior of the value. One such benefit is the preservation of multimodality in value distributions, which leads to more stable learning of the value function (Bellemare et al., 2017a) . Despite the development, several issues remain, hindering DRL from becoming a robust framework. First, a theoretical instability exists in the control setting of value-based DRL methods (Bellemare et al., 2017a) . Second, previous DRL algorithms are limited to a single type of action space, either discrete (Bellemare et al., 2017a; Dabney et al., 2018b; a) or continuous (Barth-Maron et al., 2018; Singh et al., 2020) . Third, a common choice of loss function is Huber quantile regression loss which is vulnerable to conflation between samples and statistics without an imputation strategy (Rowland et al., 2019) . The instability issue is not present if a trainable policy is introduced, i.e., the evaluation setting of the Bellman operator is used, as shown by the convergence of distributional Bellman operator (Bellemare et al., 2017a) . In addition, the general form of the stochastic policy gradient method does not assume a specific type of action space, e.g. discrete or continuous (Williams, 1988; 1992; Sutton et al., 1999) . Because Wasserstein distance has biased sample gradients (Bellemare et al., 2017b) , in practice, directly minimizing the Wasserstein distance is often not preferred as a loss function of a neural network and thus some of the exemplary works (Dabney et al., 2018b; a) of deep DRL minimizes the Huber quantile regression loss (Huber, 1964) instead. To this end, we treat the methods which minimize the Huber quantile loss as our baseline. However, as proven in Rowland 2019), representing a distribution using the Huber quantiles instead of samples can lead to conflation between statistics and samples, and thus an imputation strategy is required to learn a more accurate representation of a distribution. Despite its theoretical soundness, the imputation strategy can introduce computational overheads depending on the statistics (e.g. quantiles, expectiles). To avoid this issue, we formulate the DRL problem through samples and parameters instead of statistics, and minimize the Cramér distance between the value distributions. Combining these solutions, we arrive at a distributional actor-critic with Cramér distance as the value distribution loss. On the other hand, many actor-critic methods suffer from data inefficiency and often include multistep algorithms like the λ-return algorithm (Watkins, 1989) and off-policy updates, e.g. importance sampling. Here we address this problem by adapting multi-step off-policy updates to the distributional perspective, by defining a generalized form of the multi-step distributional Bellman targets. Furthermore, we introduce a novel value-distribution learning method which we call Sample-Replacement, denoted by SR(λ). We show that the expectation of the target distribution from SR(λ) is equivalent to the scalar λ-return. Additionally, we propose to parameterize the value distribution as a Gaussian mixture model (GMM). When combining GMM with the Cramér distance, we can derive an analytic solution and obtain unbiased sample gradients at a much lower computational cost compared to the method using Huber quantile loss. Altogether, we call our framework GMAC (Gaussian mixture actor-critic). We present experimental results to demonstrate that this framework successfully outperforms the baseline algorithm with scalar value function in discrete action space, and can be expanded to continuous action spaces without any architectural or algorithmic modification. Furthermore, we show that more accurate representation of value distributions is learned, with a less computational cost. 2 RELATED WORKS Bellemare et al. (2017a) has shown that the distributional Bellman operator derived from the distributional Bellman equation is a contractor in a maximal form of the Wasserstein distance. Based on this point, Bellemare et al. (2017a) proposed a categorical distributional model, C51, which is later discussed to be minimizing the Cramér distance in the projected distributional space (Rowland et al., 2018; Bellemare et al., 2019) . Dabney et al. (2018b) proposed quantile regression-based models, QR-DQN, which parameterizes the distribution with a uniform mixture of Diracs and uses sample-based Huber quantile loss (Huber, 1964) . Dabney et al. (2018a) later expand it further so that a full continuous quantile function can be learned through the implicit quantile network (IQN). Yang et al. (2019) then further improved the approximation of the distribution by adjusting the set of quantiles. Rowland et al. (2019) proposed expectile regression in place of quantile regression for learning categorical distribution to address the error in the Bellman target approximation. Choi et al. (2019) suggested parameterizing the value distribution using Gaussian mixture and minimizing the Tsallis-Jenson divergence as the loss function on a value-based method. Outside of RL, Bellemare et al. (2017b) proposed to use Cramér distance in place of Wasserstein distance used in WGAN due to its unbiasedness in sample gradients (Arjovsky et al., 2017) . There have been many applications of the distributional perspective, which exploit the additional information from value distribution. Dearden et al. (1998) modeled parametric uncertainty and Morimura et al. (2010a; b) designed a risk-sensitive algorithm using a distributional perspective, which can be seen as the earliest concept of distributional RL. Mavrin et al. (2019) utilized the idea of the uncertainty captured from the variance of value distribution. Nikolov et al. (2019) has also utilized the distributional representation of the value function by using information-directedsampling for better exploration of the value-based method. While multi-step Bellman target was considered (Hessel et al., 2018) , the sample-efficiency was directly addressed by combining multistep off-policy algorithms like Retrace(λ) (Gruslys et al., 2017) . Just as C51 has been expanded deep RL to disitributional perspective, Barth-Maron et al. (2018) studied a distributional perspective on DDPG (Lillicrap et al., 2016) , an actor-critic method, by parameterizing a distributional critic as categorical distribution and Gaussian mixture model. Singh et al. (2020) has further expanded the work by using an implicit quantile network for the critic. Several works (Duan et al., 2020; Kuznetsov et al., 2020; Ma et al., 2020) have proposed a distributional version of the soft-actor-critic (SAC) framework to address the error from over-estimating the value. However, previous works concentrated on extending a specific actor-critic framework to the distributional setting. Therefore, we aim to suggest methods which may be easily adopted in the process of expanding a scalar value methods to a distributional perspective, along with an attempt to address the previously mentioned issues present in the value-based ditributional algorithms.

3. DISTRIBUTIONAL REINFORCEMENT LEARNING

We consider a conventional RL setting, where an agent's interaction with its environment is described by a Markov Decision Process (MDP) (X , A, R, P, γ), where X and A are state and action spaces, R(x, a) is the stochastic reward function for a pair of state x and action a, P (x |x, a) is the transition probability of observing x given the pair (x, a), and γ ∈ (0,1) is a time discount factor. A policy π(•|x) maps a state x ∈ X to a probability distribution over actions a ∈ A. The objective of RL is to maximize the expected return, E[G t ] where G t = ∞ t=0 γ t R(x t , a t ) is the sum of discounted rewards from state x t given a policy π at time t. Then for any state x t , the value V and state-action value Q under the given policy π can be defined as V (x t ) = E[G t | X = x t ], Q(x t , a t ) = E[G t | X = x t , A = a t ]. A recursive relationship in the value in terms of the reward and the random transition is described by the Bellman equation (Bellman, 1957) given by Q(x, a) = E[R(x, a)] + γE a ∼π,x ∼P [Q(x , a )] , where the first expectation is calculated over a given state-action pair (x, a) and the second expectation is taken over the next possible states x ∼ P (•|x, a) and actions a ∼ π(•|x). DRL extends the Bellman equation to an analogous recursive equation, termed the distributional Bellman equation (Morimura et al., 2010a; b; Bellemare et al., 2017a) , using a distribution of the possible sum of discounted rewards Z(x, a): Z(x, a) D = R(x, a) + γZ(x , a ), where D = denotes having equal distributions and Q(x, a) = E[Z(x, a)]. Then Z is learned through distributional Bellman operator T π defined as T π Z(x, a) : D = R(x, a) + γP π Z(x, a) where P π : Z → Z is a state transition operator under policy π, P π Z(x, a) D := Z(x , a ). Analogously, the distributional Bellman optimality operator T can be defined as T Z(x, a) D := R(x, a) + γZ(x , arg max a E x ∼P [Z(x , a )]). (5) Algorithm 1: SR(λ) Input: Trajectory of states and values {(x 1 , Z 1 ), . . . , (x T , Z T )} for a given length T , discount factor γ, weight parameter λ Output: Set of λ-returns {Z (λ) 1 , . . . , Z (λ) T -1 } X ← Collect m samples {X 1 , . . . , X m } from Z T for t = T -1 to 1 do X ← r t + γX // Bellman operator Z (λ) t ← m i=1 δ Xi // empirical distribution using m Diracs X ← Collect m samples {X 1 , . . . , X m } from Z (λ) t for i = 1 to m do X i ← X i with probability 1 -λ end end The distributional Bellman operator has been proven to be a γ-contraction in a maximal form of Wasserstein distance (Bellemare et al., 2017a) , which has a practical definition given by d p (U, V ) = 1 0 |F -1 U (ω) -F -1 V (ω)| p dω 1/p , where U, V are random variables and F U , F V are their cumulative distribution functions (cdf). However, unlike the distributional Bellman operator, the distributional Bellman optimality operator is not a contractor in any metric (Bellemare et al., 2017a) . Thus, the distance d p (T Z 1 , T Z 2 ) between some random variables Z 1 , Z 2 may not converge to a unique solution. This issue has been discussed in Bellemare et al. (2017a) , with an example of oscillating value distribution caused by a specific tie-breaker design of the argmax operator. One way to remove this issue from consideration is by learning the value distributions via expected Bellman operator with a trainable stochastic policy and finding an optimal policy under principles of conservative policy iteration by Kakade & Langford (2002) . See Appendix A for more discussion.

4. ALGORITHM

In this section, we incrementally develop the building blocks of our proposed method. First, we present an efficient distributional version of λ-return algorithm called Sample-Replacement, denoted by SR(λ). Then, we show that minimizing the energy distance, which is equivalent to a specific form of the Cramér distance, between Gaussian mixtures can be a better solution than using quantile regressions when working with distributional actor-critic and SR(λ).

4.1. SR(λ): SAMPLE-REPLACEMENT FOR λ-RETURN DISTRIBUTION

The actor-critic method is a temporal-difference (TD) learning method in which the value function, the critic, is learned through the TD error defined by the difference between the TD target given by n-step return, G (n) t = n i=1 γ i-1 r t+i + γ n V (x t+n ) , and the current value estimate V (x t ). A special case of TD method, called TD(λ) (Sutton, 1988) , generates a weighted average of n-step returns for the TD target, also known as the λ-return, G (λ) t = (1 -λ) ∞ n=1 λ n-1 G (n) t , λ ∈ [0, 1], to mitigate the variance and bias trade-off between Monte Carlo and the TD(0) return to enhance data efficiency. From an alternative perspective, equation 7 can be thought of as finding a TD target via taking the expectation of a random variable G whose sample space is the set of all n-step returns, {G (0) t , . . . , G (∞) t }. Then the probability distribution of G is given by Pr[ G = G (n) t ] = (1 -λ)λ n-1 . (8) Similar to G (n) t , we define n-step approximation of the value distribution as Z (n) t : D = n-1 i=0 γ i R(x t+i , a t+i ) + γ n E a ∼π [Z(x t+n , a )], where E[Z (n) t ] = G (n) t . Then, a distributional analogy of equation 8 can be written as Pr[ Z = Z (n) t ] = (1 -λ)λ n-1 (10) where Z is a random variable whose sample space is a set of all n-step approximations, {Z (0) t , . . . , Z (∞) t }. Unlike G, we cannot directly calculate an expectation over a set of random variables, Z. Instead, we redefine equation 10 in terms of cdfs: Pr[ F = F Z (n) t ] = (1 -λ)λ n-1 . (11) F Z (n) t denotes the cdf of the n-step return Z (n) t , and F = {F Z (0) t , . . . , F Z (∞) t } is a random vari- able over the set of F Z (n) t . Then, for any z, we can successfully rewrite equation 11 as a linear combination of F Z (n) t E[ F ] = (1 -λ) ∞ n=1 λ n-1 F Z (n) t . ( ) Let us define a random variable Z (λ) t that has E[ F ] as its cdf. Then the expectation of Z (λ) t and the expectation of Z (n) t have an analogous relationship to equation 7 (see Appendix B), meaning that its behavior in expectation is equal to that of the λ-return. Therefore, we treat the resulting random variable Z (λ) t as a distributional analogy of the λ-return. We note that, in practice, collecting infinite horizon trajectory is infeasible and thus we use a truncated sum (Cichosz, 1995; van Seijen et al., 2011) : F Z (λ) t = (1 -λ) N n=1 λ n-1 F Z (n) t + λ N F Z (N ) t . ( ) Given a trajectory of length N , naïvely speaking, finding Z (λ) t for each time step requires finding N different Z (n) t . As a result, we need to find total of O(N 2 ) different distributions to find Z t (λ) for all states in the given trajectory. The number of distributions to find can be reduced to O(N ), in practice, by approximating the distribution of Z (n) t with a mixture of diracs, as described in Dabney et al. (2018b) , because the Bellman operations can be applied to the same set of samples to find Z (n) t for different t's. Then, the approximation takes the form of Z (n) t ≈ Z θ (x t ) := 1 m m i=1 δ θi(xt) where θ : X → R m is some parametric model. To obtain the target samples for Z θ (x t ), we start with m samples of the last value distribution in the sampled trajectory. Traversing through the sampled trajectory in a reversed order, we replace each sample with a new sample from the next state with a probability of 1 -λ. Then the obtained set X m is equivalent to a set of samples from the approximated distribution of the λ-returns, Z t . A more detailed description of the algorithm can be found in Algorithm 1. For sample-based methods like the implicit quantile network, one can directly use this set X m as the target samples. However, Rowland et al. (2019) has shown that the quantiles predicted from the Huber quantile loss cannot be interpreted as samples, thus an imputation strategy is required to generate a distribution from the statistics. We propose to minimize the Cramér distance instead in which the predicted parameters can be samples or the parameters of the distribution itself.

4.2. CRAM ÉR DISTANCE

Let P and Q be probability distributions over R. If we define the cdf of P, Q as F P , F Q respectively, the l p family of divergence between P and Q is l p (P, Q) := ∞ -∞ |F P (x) -F Q (x)| p dx 1/p . ( ) When p = 2, it is termed the Cramér distance. The distributional Bellman operator in the evaluation setting is a |γ| 1/p -contraction mapping in the Cramér metric space (Rowland et al., 2019) , whose worked out proof can be found in Appendix C. A notable characteristic of the Cramér distance is the unbiasedness of the sample gradient, E X∼Q ∇ θ l 2 2 ( Pm , Q θ ) = ∇ θ l 2 2 (P, Q θ ) where X m := {X 1 , ..., X m } are samples drawn from P , Pm := 1 m m i=1 δ Xi is the empirical distribution, and Q θ is a parametric approximation of a distribution. The unbiased sample gradient makes it suitable to use Cramér distance with stochastic gradient descent method and empirical distributions for updating the value distribution. Székely (2002) showed that, in the univariate case, the squared Cramér distance is equivalent to one half of energy distance (l 2 2 (P, Q) = 1 2 E(P, Q)) defined as E(P, Q) := E(U, V ) = 2 E U -V 2 -E U -U 2 -E V -V 2 , (17) where U, U and V, V are random variables that follow P, Q, respectively.

4.3. ENERGY DISTANCE BETWEEN GAUSSIAN MIXTURE MODELS

So far, we have described the components required to formulate a general setting of the suggested distributional extension to actor-critic methods. Here, we take one step further to enhance the approximation accuracy and computational efficiency by considering the parameterized model of the value distribution as a Gaussian mixture model (Choi et al., 2019; Barth-Maron et al., 2018) . Following the same assumption used for equation 14 the approximation using Gaussian mixture is given using parametric models µ, σ, w : X → R K Z θ (x t ) := K i=1 w i (x t ) N (z; µ i (x t ), σ i (x t ) 2 ), where K i=1 w i (x t ) = 1. ( ) If random variables X, Y follow the distributions P, Q parameterized as GMM, the energy distance has the following closed-form E(X, Y ) = 2δ(X, Y ) -δ(X, X ) -δ(Y, Y ), where δ(U, V ) = i,j w ui w vj E N (z; µ ui -µ vj , σ 2 ui + σ 2 vj ) . Here, µ xi refers to the i th component for random variable X and same applies for σ and w. With Gaussian mixtures, the closed-form solution of the energy distance defined in equation 19 has a computational advantage over sample-based approximations like the Huber quantile loss. When using GMM, the analytic approximation of equation 14 can be derived as Z (λ) t ≈ ∞ n=1 (1 -λ)λ n-1 K k=1 w nk N (z; µ nk , σ 2 nk ) ≈ 1 m m i=1 N (z; µ nk , σ 2 nk ), where µ nk refers to k th component of µ Z (n) t for simplicity of notation, n is sampled from Geo(1-λ) and the index k ∈ {1, . . . , K} is sampled proportional to the mixture weights {w 1 , . . . , w K }. This is equivalent to having a mixture of m Gaussians, thus we can simply perform sample replacement on the parameters (µ, σ 2 ), instead of realizations of the random variables as in equation 14. Then, the loss function described in equation 19 can easily be applied. When bringing all the components together, we have a distributional actor-critic framework with SR(λ) that minimizes the Cramér distance between Gaussian mixture value distributions. We call this method GMAC. A brief sketch of the algorithm is shown in Appendix E. 

5. EXPERIMENTS

In this section, we present experimental results for three different distributional versions of Proximal Policy Optimization (PPO) with SR(λ): IQAC (IQN + Huber quantile), IQAC-E (IQN + energy distance), and GMAC (GMM + energy distance), in the order of the progression of our suggested approach. The performance of the scalar version of PPO with value clipping (Schulman et al., 2016) is used as the baseline for comparison. Details about the loss function of each method can be found in Appendix D. For a fair comparison, we keep all common hyperparameters consistent across the algorithms except for the value heads and their respective hyperparameters (see Appendix F). The results demonstrate three contributions of our proposed DRL framework: 1) the ability to correctly capture the multimodality of value distributions, 2) generalization to both discrete and continuous action spaces, and 3) significantly reduced computational cost. Representing Multimodality As discussed throughout Section 4, we expect minimizing the Cramér distance to produce a more appropriate depiction of a distribution compared to minimizing the Huber quantile loss. First, we demonstrate this with a simple value regression problem for an MDP of five sequential states, as shown in Figure 2 (a). The reward function r i of last two state S i is stochastic, with r 4 from a uniform discrete distribution and r 5 from a normal distribution. Then the value distribution of S 1 should be bimodal with expectation of zero (Figure 2 (b)). In this example, minimizing the Huber quantile loss (labeled as IQ-naive) of a implicit quantile network underestimates the variance of S 1 due to conflation and does not capture the locations of the modes. Applying an imputation strategy as suggested in Rowland et al. (2019) , improvement on the underestimation of variance can be seen. On the other hand, minimizing the Cramér distance converges to correct mode locations using both implicit quantile network and Gaussian mixture model, labeled as IQE and GM respectively in the figure. More details about the experimental setup and further results can be found in Appendix G. The comparison can be extended to more complex tasks such as the Atari games, of which a sample result is shown in Figure 1 , and additional visualizations of the value distribution during the learning process from different games can be found in Appendix G. So what can be achieved with correct modality? By capturing the correct modes of the distribution, an additional degree of freedom on top of the expected value can be accurately obtained, from which richer information can be derived to distinguish states by their value distributions. In particular, the extra information may be utilized as an intrinsic motivation in sparse-reward exploration tasks. To demonstrate, we compare using Cramér distance between value distributions as intrinsic reward to using TD error between scalar value estimates in a sparse reward environment of Montezuma's Revenge in Figure 2 (c), which shows a clear improvement in the performance.

Discrete and Continuous Action Spaces

Experimental results from the ALE (Bellemare et al., 2013) and the PyBullet environments (Coumans & Bai, 2016 -2020) show that our algorithm can be All experiments are run over 5 random seeds with consistent hyperparameters. In Figure 3 , a solid line represents an average of mean scores over 100 recent episodes and the shaded area represents the standard deviation over the seeds. For visual clarity, the plots are smoothed over 5M frames and 1.25M frames in Atari and PyBullet, respectively. Non-smoothed learning curves on more tasks and the final scores over the 61 ALE tasks can be found in Appendix G. Computational Cost Table 1 shows the number of parameters and the number of floating-point operations (FLOPs) required for a single inference and update step of each agent. We emphasize three points here. First, the implicit quantile network requires more parameters due to the intermediate embeddings of random quantiles. Second, the difference between the FLOPs for a single update in IQAC and IQAC-E indicates that the proposed energy distance requires less computation than the Huber quantile regression. Last, the results for GMAC show that using Gaussian Mixtures can greatly reduce the cost even to match the numbers of PPO while having improved performance. 

6. CONCLUSION

In this paper, we have developed the distributional perspective of the actor-critic framework which integrates the SR(λ) method, Cramér distance, and Gaussian mixture models for improved performance in both discrete and continuous action spaces at a lower computational cost. Furthermore, we show that our proposed method can capture the correct modality in the value distribution, while the extension of the conventional method with the stochastic policy fails to do so. Capturing the correct modality of value distributions can improve the performance of various policybased RL applications that exploit statistics from the value distribution. Such applications may include training risk-sensitive policies and learning control tasks with sparse rewards that require heavy exploration, where transient information from the value distribution can give benefit to the learning process. We leave further development of these ideas as future works.

Appendices

A DISCUSSION ON THE CHOICE OF PROXIMAL POLICY OPTIMIZATION AS A BASELINE A general learning process of RL can be described using policy iteration, which consists of two iterative phases: policy evaluation and policy improvement (Sutton & Barto, 1998) . In policy iteration, the value function is assumed to be exact, meaning that given policy, the value function is learned until convergence for the entire state space, which results in a strong bound on the rate of convergence to the optimal value and policy (Puterman, 1994) . But the exact value method is often infeasible from resource limitation since it requires multiple sweeps over the entire state space. Therefore, in practice, the value function is approximated, i.e. it is not trained until convergence nor across the entire state space on each iteration. The approximate version of the exact value function method, also known as asynchronous value iteration, still converges to the unique optimal solution of the Bellman optimality operator. However, the Bellman optimality only describes the limit convergence, and thus the best we can practically consider is to measure the improvement on each update step. Bertsekas & Tsitsiklis (1996) have shown that, when we approximate the value function V π of some policy π with Ṽ , the lower bound of a greedy policy π is given by V π (x) ≥ V π (x) - 2γε 1 -γ , where ε = max x | Ṽ (x) -V π (x)| is the L ∞ error of value approximation Ṽ . This means a greedy policy from an approximate value function guarantees that its exact value function will not degrade more than 2γε 1-γ . However, there is no guarantee on the improvement, i.e. V π (x) > V π (x) (Kakade & Langford, 2002) . As a solution to this issue, Kakade & Langford (2002) have proposed a policy updating scheme named conservative policy iteration, π new (a|x) = (1 -α)π old (a|x) + απ (a|x), (22) which has an explicit lower bound on the improvement η(π new ) ≥ L π old (π new ) - 2 γ (1 -γ) 2 α 2 , = max x |E π [A π (x, a)]|, A π (x, a) = Q(x, a) -V (x) is the advantage function, η(π) denotes the expected sum of reward under the policy π, η(π) = E ∞ t=0 γ t R(x t , a t ) , and L π old is the local approximation of η with the state visitation frequency under the old policy. From the definition of distributional Bellman optimality operator in equation 5, one can see that the lower bound in equation 23 also holds when π is greedy with respect to the expectation of the value distribution, i.e., E x ∼P [Z(x , a )]. Thus the improvement of the distributional Bellman update is guaranteed in expectation under conservative policy iteration, and the value functions are guaranteed to converge in distribution to a fixed point by γ-contraction. Schulman et al. (2015) takes this further, suggesting an algorithm called trust region policy optimization (TRPO), which extends conservative policy iteration to a general stochastic policy by replacing α with Kullback-Leibler (KL) divergence between two policies, D max KL (π, π) = max x D KL (π(•|x) π(•|x)) . Then, the newly formed objective is to maximize the following, which is a form of constraint optimization with penalty: Êt π(a t |x t ) π(a t |x t ) Ât -βD KL (π(•|x t ), π(•|x t )) = Êt r t (π) Ât -βD KL (π(•|x t ), π(•|x t )) . ( ) where r(π) refers to the ratio r(π) = π(at|xt) π(at|xt) . However, in practice, choosing a fixed penalty coefficient β is difficult and thus Schulman et al. (2015)  F Z (λ) t = (1 -λ) ∞ n=1 λ n-1 F Z (n) t . If we assume that the support of Z (λ) t is defined in the extended real line [-∞, ∞], E[Z (λ) t ] = ∞ 0 1 -F Z (λ) t dz - 0 -∞ F Z (λ) t dz (31) = ∞ 0 1 -(1 -λ) ∞ n=1 λ n-1 F Z (n) t dz - 0 -∞ (1 -λ) ∞ n=1 λ n-1 F Z (n) t dz (32) = (1 -λ) ∞ n=1 λ n-1 ∞ 0 1 -F Z (n) t dz - 0 -∞ F Z (n) t dz (33) = (1 -λ) ∞ n=1 λ n-1 G (n) t = G (λ) t . Thus we can arrive at the desired expression of E[Z (λ) t ] = G (λ) t .

C DISTRIBUTIONAL BELLMAN OPERATOR AS A CONTRACTION IN CRAM ÉR METRIC SPACE

The Cramér distance possesses the following characteristics (detailed derivation of each can be found in (Bellemare et al., 2017b )): l p (A + X, A + Y ) ≤ l p (X, Y ), l p (cX, cY ) ≤ |c| 1/p l p (X, Y ). ( ) Using the above characteristics, the Bellman operator in l p divergence is l p (T π Z 1 (x, a), T π Z 2 (x, a)) = l p (R(x, a) + γP π Z 1 (x, a), R(x, a) + γP π Z 2 (x, a)) ≤ |γ| 1/p l p (P π Z 1 (x, a), P π Z 2 (x, a)) ≤ |γ| 1/p sup x ,a l p (Z 1 (x , a ), Z 2 (x , a )). Under review as a conference paper at ICLR 2021 Substituting the result into the definition of the maximal form of the Cramér distance yields lp (T π Z 1 , T π Z 2 ) = sup x,a l p (T π Z 1 (x, a), T π Z 2 (x, a)) ≤ |γ| 1/p sup x ,a l p (Z 1 (x , a ), Z 2 (x , a )) = |γ| 1/p lp (Z 1 , Z 2 ). Thus the distributional Bellman operator is a |γ| 1/p -contraction mapping in the Cramér metric space, which was also proven in Rowland et al. (2019) . Similar characteristics as in equation 35 can be derived for the energy distance E(A + X, A + Y ) ≤ E(X, Y ), E(cX, cY ) = cE(X, Y ), showing that the distributional Bellman operator is a γ-contractor in energy distance E(T π Z 1 , T π Z 2 ) ≤ γE(Z 1 , Z 2 ).

D LOSS FUNCTIONS

As in other policy gradient methods, our value distribution approximator models the distribution of the value, V (x t ), not the state-action value Q(x t , a t ), and denote it as Z θ (x t ) parametrized with θ, whose cumulative distribution function is defined as F Z θ (xt) = a∈A π(a, x t )F Z(xt,a) . Below, we provide the complete loss function of value distribution approximation for each of the cases used in experiments (Section 5). D.1 IMPLICIT QUANTILE + HUBER QUANTILE (IQAC) For the value loss of IQAC, we follow the general flow of Huber quantile loss described in Dabney et al. (2018b) . For two random samples τ, τ ∼ U ([0, 1]), δ τ,τ t = Z (λ) t (x t , a t ; τ ) -Z θ (x t ; τ ) where Z (λ) t is generated via SR(λ) and Z(x; τ ) = F -1 Z (τ ) is realization of Z(X) given X = x and τ . Then, the full loss function of value distribution is given by L Z θ = 1 N N i=1 N j=1 ρ κ τi δ τi,τ j t where N and N are number of samples of τ, τ , respectively, and ρ is the Huber quantile loss ρ κ τ (δ ij ) = |τ -I{δ ij < 0}| L κ (δ ij ) κ , with L κ (δ ij ) = 1 2 δ 2 ij , if|δ ij | ≤ κ κ(|δ ij | -1 2 κ), otherwise. D.2 IMPLICIT QUANTILE + ENERGY DISTANCE (IQAC-E) Here, we replace the Huber quantile loss in equation 42 with sample-based approximation of energy distance defined in equation 19. L Z θ = 2 N N N i=1 N j=1 δ τi,τ j t - 1 N 2 N i=1 N i =1 δ τi,τ i t - 1 N 2 N j=1 N j =1 δ τ j ,τ j t

D.3 GAUSSIAN MIXTURE + ENERGY DISTANCE (GMAC)

Unlike the two previous losses, which use samples at τ generated by the implicit quantile network Z θ (x t ; τ ), here we discuss a case in which the distribution is k-component Gaussian mixture parameterized with (µ k , σ 2 k , w k ). Using the expectation of a folded normal distribution, we define δ between two Gaussian distributions as δ(µ i , σ 2 i , µ j , σ 2 j ) = 2 π σ 2 i + σ 2 j exp - (µ i -µ j ) 2 2(σ 2 i + σ 2 j ) + (µ i -µ j ) 1 -2Φ (µ i -µ j ) √ 2 . ( ) Let Z θ (x) and Z (λ) t be Gaussian mixtures parameterized with (µ θi , σ 2 θi , w θi ), (µ λj , σ 2 λj , w λj ), respectively. Then, the loss function for the value head is given by L Z θ = 2 N N N i=1 N j=1 w θi w λj δ(µ θi , σ 2 θi , µ λj , σ 2 λj ) - 1 N 2 N i=1 N i =1 w θi w θi δ(µ θi , σ 2 θi , µ θi , σ 2 θi ) - 1 N 2 N j=1 N j =1 w λj w λj δ(µ λj , σ 2 λj , µ λj , σ 2 λj ).

E PSEUDOCODE OF GMAC

Algorithm 2: GMAC Input: Initial policy parameters θ 0 , initial value function parameters φ 0 , length of trajectory T , number of environments E, clipping factor , discount factor γ, weight parameter λ for k = 0, 1, 2, . . . do for e = 1, . . . , E do Collect samples of discounted sum of rewards {Z 1 , . . . , Z T } by running policy π k = π(θ k ) in the environment Compute the parameters (µ i , σ i , w i ) for each of the λ-returns {Z (λ) 1 , . . . , Z T -1 } by SR(λ) (Algorithm 1) Compute advantage estimates Ât using GAE (Schulman et al., 2016) , based on the current value function V φ k end Gather the data from E environments Update policy using the clipped surrogate loss: θ k+1 = arg max θ E min π θ (a t |s t ) π θ k (a t |s t ) Ât , g( , Ât ) via stochastic gradient ascent. Update value function using the energy distance between Gaussian mixtures (Equation 19): φ k+1 = arg min φ E E V φ (s t ), Z (λ) t via stochastic gradient descent. end The clipping function g( , A) shown in the algorithm is defined as follows: g( , A) = (1 + )A if A ≥ 0 (1 -)A if A < 0 Note that expectation of each loss is taken over each the collection of trajectories and environments.

G MORE EXPERIMENTAL RESULTS

G.1 FIVE-STATE MDP Here we provide more details on the five-state MDP presented in Figure 2 . For each cases in the figure, 15 diracs are used for quantile based methods and 5 mixtures are used for GMM to balance the total number of parameters required to represent a distribution. For the cases with the label "naïve", the network outputs (quantiles, expectiles, etc.) are used to create the plot. On the other hand, the cases with "imputation" labels apply appropriate imputation strategy to the statistics to produce samples which are then used to plot the distribution. In all cases, the network is trained on full batches of all states for 8 different empirical samples with Adam optimizer and learning rate of 1e-3. Sample based energy-distance was used to calculate the distance from the true distribution for all cases. The tabular setting of five-state MDP in Figure 5 



Figure 1: Modality of value distribution during the learning process of Breakout-v4. (a) An arrow is added in the inset to indicate the ball's direction of travel. The episode reaches a terminal state if the paddle misses the ball. (b) Probability density functions of the value distributions learned by each actor-critic when {0.2, 0.4, 1.2}M frames are collected. As the policy improves, the probability of losing a turn (V = 0) decreases, and the probability of earning scores (V > 0) increases. Note that the modality transition from V = 0 is clearly captured by the GMM + Cramér method.

Figure 2: (a) An environment with two states and stochastic rewards with expected value of zero. (b) The probability density function (above) and the Cramér distance between the ground truth and the estimated distributions for IQ, IQE, and GM (below). (c) Learning curve of Montezuma's Revenge using modality information as intrinsic reward.

Figure 3: Learning curves for Atari games from ALE and 3 continuous control tasks from PyBullet.

Figure4: In addition to 2, quantile and expectile regressions are also evaluated in the 5-state MDP with imputation using neural networks architectures of implicit quantile network against IQE and Gaussian mixture with their respective loss functions.

Figure6: More value distributions of different tasks. All states are chosen such that the agent is in place of near-death or near positive score. Thus, when the policy is not fully trained, such as in a very early stage, the value distribution should include a notion of death indicated by a mode positioned at zero. In all games, IQN + Huber quantile (IQAC) fails to correctly capture a mode positioned at zero while the other two methods, IQN + energy distance (IQAC-E) and GMM + energy distance (GMAC) captures the mode in the early stage of policy improvement. Again, the visual representation is maxpool of the 4 frame stacks in the given state.

Figure 7: Raw learning curves for 8 selected atari games.

Figure 8: Raw learning curves for 5 selected PyBullet continuous control tasks.

Figure 9: Full learning curves of 61 atari games from ALE

FLOP measurement results for a single process in Breakout-v4

uses hard constraint instead of the penalty.

Average score over last 100 episodes in 200M frame collected for training 61 atari games. The algorithms are trained using same single seed and hyperparameters. Random and Human scores are taken from Wang et al.

F IMPLEMENTATION DETAILS

For producing a categorical distribution, a softmax layer was added to the output of the network. For producing a Gaussian mixture distribution, the mean of each Gaussian is simply the output of the network, the variance is kept positive by running the output through a softplus layer, and the weights of each Gaussian is produced through the softmax layer. Since our proposed method takes an archi- tecture which only changes the value head of the original PPO network, we base our hyperparameter settings from the original paper (Schulman et al., 2017) . We performed a hyperparameter search on a subset of variables: optimizers={Adam, RMSprop}, learning rate={2.5e-4, 1.0e-4}, number of epochs={4, 10}, batch size={256, 512}, and number of environments={16, 32, 64, 128} over 3 atari tasks of Breakout, Gravitar, and Seaquest, for which there was no degrade in the performance of PPO. 

