HOW DOES VALUE DISTRIBUTION IN DISTRIBU-TIONAL REINFORCEMENT LEARNING HELP OPTI-MIZATION?

Abstract

We consider the problem of learning a set of probability distributions from the Bellman dynamics in distributional reinforcement learning (RL) that learns the whole return distribution compared with only its expectation in classical RL. Despite its success to obtain superior performance, we still have a poor understanding of how the value distribution in distributional RL works. In this study, we analyze the optimization benefits of distributional RL by leveraging its additional value distribution information over classical RL in the Neural Fitted Z-Iteration (Neural FZI) framework. To begin with, we demonstrate that the distribution loss of distributional RL has desirable smoothness characteristics and hence enjoys stable gradients, which is in line with its tendency to promote optimization stability. Furthermore, the acceleration effect of distributional RL is revealed by decomposing the return distribution. It turns out that distributional RL can perform favorably if the value distribution approximation is appropriate, measured by the variance of gradient estimates in each environment for any specific distributional RL algorithm. Rigorous experiments validate the stable optimization behaviors of distributional RL, contributing to its acceleration effects compared to classical RL. The findings of our research illuminate how the value distribution in distributional RL algorithms helps the optimization.

1. INTRODUCTION

Distributional reinforcement learning (Bellemare et al., 2017a; Dabney et al., 2018b; a; Yang et al., 2019; Zhou et al., 2020; Nguyen et al., 2020; Luo et al., 2021; Sun et al., 2022) characterizes the intrinsic randomness of returns within the framework of Reinforcement Learning (RL). When the agent interacts with the environment, the intrinsic uncertainty of the environment seeps in the the stochasticity of rewards the agent receives and the inherently chaotic state and action dynamics of physical interaction, increasing the difficulty of the RL algorithm design. Distributional RL is aimed at representing the entire distribution of returns in order to capture more intrinsic uncertainty of the environment, and therefore to use these value distributions to evaluate and optimize the policy. This is in stark contrast to the classical RL that only focuses on the expectation of the return distributions, such as temporal-difference (TD) learning (Sutton & Barto, 2018) and Q-learning (Watkins & Dayan, 1992) . As a promising branch of RL algorithms, distributional RL has demonstrated the state-of-the-art performance in a wide range of environments, e.g., Atari games, in which the representation of return distributions and the distribution divergence between the current and target return distributions within each Bellman update are pivotal to its empirical success (Dabney et al., 2018a; Sun et al., 2021b; 2022) . Specifically, categorical distributional RL, e.g., C51 (Bellemare et al., 2017a; Rowland et al., 2018) , integrates a categorical distribution by approximating the density probabilities in pre-specified bins with a bounded range and Kullback-Leibler (KL) divergence, serving as the first successful distributional RL family in recent years. Quantile Regression (QR) distributional RL, e.g., QR-DQN (Dabney et al., 2018b) , approximates Wasserstein distance by the quantile regression loss and leverages quantiles to represent the whole return distribution. Other variants of QR-DQN, including Implicit Quantile Networks (IQN) (Dabney et al., 2018a) and Fully parameterized Quantile Function (FQF) (Yang et al., 2019) , can even achieve significantly better performance across plenty of Atari games. Moment Matching distributional RL (Nguyen et al., 2020) learns deterministic samples to evaluate the distribution distance based on Maximum Mean Discrepancy, while a more recent work called Sinkhorn distributional RL (Sun et al., 2022) interpolates Maximum Mean Discrepancy and Wasserstein distance via Sinkhorn divergence (Sinkhorn, 1967) . Meanwhile, distributional RL also inherits other benefits in risk-sensitive control (Dabney et al., 2018a) , policy exploration settings (Mavrin et al., 2019; Rowland et al., 2019) and robustness (Sun et al., 2021a) . Despite the remarkable empirical success of distributional RL, the illumination on its theoretical advantages is still less studied. A distributional regularization effect (Sun et al., 2021b) stemming from the additional value distribution knowledge has been characterized to explain the superiority of distributional RL over classical RL, but the benefit of the proposed regularization on the optimization of algorithms has not been investigated as the optimization plays a key role in RL algorithms. In the literature of strategies that can help the learning in RL, recent progresses mainly focus on the policy gradient methods (Sutton & Barto, 2018) . Mei et al. (2020) show that the policy gradient with a softmax parameterization converges at a O(1/t) rate, with constants depending on the problem and initialization, which significantly expands the existing asymptotic convergence results. Entropy regularization (Haarnoja et al., 2017; 2018) has gained increasing attention as it can significantly speed up the policy optimization with a faster linear convergence rate (Mei et al., 2020) . Ahmed et al. ( 2019) provide a fine-grained understanding on the impact of entropy on policy optimization, and emphasize that any strategy, such as entropy regularization, can only affect learning in one of two ways: either it reduces the noise in the gradient estimates or it changes the optimization landscape. These commonly-used strategies that accelerate RL learning inspire us to further investigate the optimization impact of distributional RL arising from the exploitation of return distributions. In this paper, we study the theoretical superiority of distributional RL over classical RL from the optimization standpoint. We begin by analyzing the optimization impact of different strategies within the Neural Fitted Z-Iteration (Neural FZI) framework and point out two crucial factors that contribute to the optimization of distributional RL, including the distribution divergence and the distribution parameterization error. The smoothness property of distributional RL loss function has also been revealed leveraging the categorical parameterization, yielding its stable optimization behavior. The uniform stability in the optimization process can thus be more easily achieved for distributional RL in contrast to classical RL. In addition to the optimization stability, we also elaborate the acceleration effect of distributional RL algorithms based on the value distribution decomposition technique proposed recently. It turns out that distributional RL can be shown to speed up the convergence and perform favorably if the value distribution is approximated appropriately, which is measured by the variance of gradient estimates. Empirical results corroborate that distributional RL indeed enjoys a stable gradient behavior by observing smaller gradient norms in terms of the observations the agent encounters in the learning process. Besides, the variance reduction of gradient estimates for distributional RL algorithms with respect to network parameters also provides strong evidence to demonstrate the smoothness property and acceleration effects of distributional RL. Our study opens up many exciting research pathways in this domain through the lens of optimization, paving the way for future investigations to reveal more advantages of distributional RL.

2. PRELIMINARY KNOWLEDGE

Classical RL. In a standard RL setting, the interaction between an agent and the environment is modeled as a Markov Decision Process (MDP) (S, A, R, P, γ), where S and A denote state and action spaces. P is the transition kernel dynamics, R is the reward measure and γ ∈ (0, 1) is the discount factor. For a fixed policy π, the return, Z π = ∞ t=0 γ t R t , is a random variable representing the sum of discounted rewards observed along one trajectory of states while following the policy π. Classical RL focuses on the value function and action-value function, the expectation of returns Z π . The action-value function Q π (s, a) is defined as Q π (s, a) = E [Z π (s, a)] = E [ ∞ t=0 γ t R (s t , a t )], where s 0 = s, a 0 = a, s t+1 ∼ P (•|s t , a t ), and a t ∼ π(•|s t ). Distributional RL. Distributional RL, on the other hand, focuses on the action-value distribution, the full distribution of Z π (s, a) rather than only its expectation, i.e., Q π (s, a). Leveraging knowledge on the entire value distribution can better capture the uncertainty of returns and thus can be advantageous to explore the intrinsic uncertainty of the environment (Dabney et al., 2018a; Mavrin et al., 2019) . The scalar-based classical Bellman updated is therefore extended to distributional

