INTERPRETING DISTRIBUTIONAL REINFORCEMENT LEARNING: A REGULARIZATION PERSPECTIVE

Abstract

Distributional reinforcement learning (RL) is a class of state-of-the-art algorithms that estimate the entire distribution of the total return rather than its expected value alone. The theoretical advantages of distributional RL over expectationbased RL remain elusive, despite the remarkable performance of distributional RL. Our work attributes the potential superiority of distributional RL to its regularization effect stemming from the value distribution information regardless of only its expectation. We decompose the value distribution into its expectation and the remaining distribution part using a variant of the gross error model in robust statistics. Hence, distributional RL has an additional benefit over expectationbased RL thanks to the impact of a risk-sensitive entropy regularization within the Neural Fitted Z-Iteration framework. Meanwhile, we investigate the role of the resulting regularization in actor-critic algorithms by bridging the risk-sensitive entropy regularization of distributional RL and the vanilla entropy in maximum entropy RL. It reveals that distributional RL induces an augmented reward function, which promotes a risk-sensitive exploration against the intrinsic uncertainty of the environment. Finally, extensive experiments verify the importance of the regularization effect in distributional RL, as well as the mutual impacts of different entropy regularizations. Our study paves the way towards a better understanding of distributional RL, especially when looked at through a regularization lens.

1. INTRODUCTION

The intrinsic characteristics of classical reinforcement learning (RL) algorithms, such as temporaldifference (TD) learning (Sutton & Barto, 2018) and Q-learning (Watkins & Dayan, 1992) , are based on the expectation of discounted cumulative rewards that an agent observes while interacting with the environment. In stark contrast to the classical expectation-based RL, a new branch of algorithms called distributional RL estimates the full distribution of total returns and has demonstrated the stateof-the-art performance in a wide range of environments (Bellemare et al., 2017a; Dabney et al., 2018b; a; Yang et al., 2019; Zhou et al., 2020; Nguyen et al., 2020; Sun et al., 2022) . Meanwhile, distributional RL also inherits other benefits in risk-sensitive control (Dabney et al., 2018a) , policy exploration settings (Mavrin et al., 2019; Rowland et al., 2019) and robustness (Sun et al., 2021) . Despite the existence of numerous algorithmic variants of distributional RL with remarkable empirical success, we still have a poor understanding of what the effectiveness of distributional RL is stemming from and theoretical studies on advantages of distributional RL over expectation-based RL are still less established. Distributional RL problems was also mapped to a Wasserstein gradient flow problem (Martin et al., 2020) , treating the distributional Bellman residual as a potential energy functional. Offline distributional RL (Ma et al., 2021) has also been proposed to investigate the efficacy of distributional RL in both risk-neutral and risk-averse domains. (Lyle et al., 2019) proved in many realizations of tabular and linear approximation settings, distributional RL behaves the same as expectation-based RL under the coupling updates method, but diverges in non-linear approximation. Although the explanation from these works is not sufficient yet, the trend is encouraging for recent works towards closing the gap between theory and practice in distributional RL. In this paper, we illuminate the behavior difference of distributional RL over expectation-based RL through the lens of regularization to explain its empirical outperformance in most practical environments. Specifically, we simplify distributional RL into a Neural Fitted Z-Iteration framework, within which we establish an equivalence of objective functions between distributional RL and a risk-sensitive entropy regularized Neural Fitted Q-Iteration from the perspective of statistics. This result is based on two analytical components, i.e., action-value density function decomposition by leveraging of a variant of gross error model in robust statistics, as well as Kullback-Leibler (KL) divergence to measure the distribution distance between the current and target value distribution in each Bellman update. Then we establish a connection between the impact of risk-sensitive entropy regularization of distributional RL and vanilla entropy in maximum entropy RL, yielding a Distribution-Entropy-Regularized Actor Critic algorithm. Empirical results demonstrate the crucial role of risk-sensitive entropy regularization effect from distributional RL in the potential superiority over expectation-based RL on both Atari games and MuJoCo environments. We also reveal mutual impacts of both risk-sensitive entropy in distributional RL and vanilla entropy in maximum entropy RL, providing more potential research directions in the future.

2. PRELIMINARY KNOWLEDGE

In classical RL, an agent interacts with an environment via a Markov decision process (MDP), a 5-tuple (S, A, R, P, γ), where S and A are the state and action spaces, respectively. P is the environment transition dynamics, R is the reward function and γ ∈ (0, 1) is the discount factor. Action-value Function vs Action-value Distribution. Given a policy π, the discounted sum of future rewards is a random variable Z π (s, a) = ∞ t=0 γ t R(s t , a t ), where s 0 = s, a 0 = a, s t+1 ∼ P (•|s t , a t ), and a t ∼ π(•|s t ). In the control setting, expectation-based RL focuses on the actionvalue function Q π (s, a), the expectation of Z π (s, a), i.e., Q π (s, a) = E [Z π (s, a)]. Distributional RL, on the other hand, focuses on the action-value distribution, the full distribution of Z π (s, a). The density function if exists of action-value distribution is called action-value density function. Bellman Operators vs Distributional Bellman Operators. For the policy evaluation in expectation-based RL, the value function is updated via the Bellman operator T π Q(s, a) = E[R(s, a)] + γE s ∼p,a ∼π [Q (s , a )]. We also define Bellman Optimality Operator T opt Q(s, a) = E[R(s, a)] + γ max a E s ∼p [Q (s , a )]. In distributional RL, the action-value distribution of Z π (s, a) is updated via the distributional Bellman operator T π , i.e., T π Z(s, a) D = R(s, a) + γZ (s , a ), where s ∼ P (•|s, a) and a ∼ π (•|s ). The equality implies that random variables of both sides are equal in distribution. This random-variable definition of distributional Bellman operator is appealing and easily understood due to its concise form, although its value-distribution definition is more mathematically rigorous (Rowland et al., 2018; Bellemare et al., 2022) . Categorical Distributional RL. Categorical Distributional RL (Bellemare et al., 2017a) can be viewed as the first successful distributional RL algorithm family that approximates the value distribution η by a discrete categorical distribution η = N i=1 p i δ zi , where z 1 ≤ z 2 ≤ ... ≤ z N is a set of fixed supports and {p i } N i=1 are learnable probabilities. The leverage of a heuristic projection operator Π C (see Appendix A for more details) as well as the KL divergence allows the theoretical convergence of categorical distribution RL under Cramér distance (Rowland et al., 2018) .

3.1. DISTRIBUTIONAL RL: NEURAL FITTED Z-ITERATION (NEURAL FZI)

Expectation-based RL: Neural Fitted Q-Iteration (Neural FQI). Neural FQI (Fan et al., 2020; Riedmiller, 2005) offers a statistical explanation of DQN (Mnih et al., 2015) , capturing its key features, including experience replay and the target network Q θ * . In Neural FQI, we update parameterized Q θ (s, a) in each iteration k in a regression problem: Q k+1 θ = argmin Q θ 1 n n i=1 y i -Q k θ (s i , a i ) 2 , where the target y i = r(s i , a i ) + γ max a∈A Q k θ * (s i , a) is fixed within every T target steps to update target network Q θ * by letting θ * = θ. The experience buffer induces independent samples {(s i , a i , r i , s i )} i∈ [n] . In an ideal case when we neglect the non-convexity and TD approximation

