NON-DECREASING QUANTILE FUNCTION NETWORK WITH EFFICIENT EXPLORATION FOR DISTRIBUTIONAL REINFORCEMENT LEARNING

Abstract

Although distributional reinforcement learning (DRL) has been widely examined in the past few years, there are two open questions people are still trying to address. One is how to ensure the validity of the learned quantile function, the other is how to efficiently utilize the distribution information. This paper attempts to provide some new perspectives to encourage the future in-depth studies in these two fields. We first propose a non-decreasing quantile function network (NDQFN) to guarantee the monotonicity of the obtained quantile estimates and then design a general exploration framework called distributional prediction error (DPE) for DRL which utilizes the entire distribution of the quantile function. In this paper, we not only discuss the theoretical necessity of our method but also show the performance gain it achieves in practice by comparing with some competitors on Atari 2600 Games especially in some hard-explored games.

1. INTRODUCTION

Distributional reinforcement learning (DRL) algorithms (Jaquette, 1973; Sobel, 1982; White, 1988; Morimura et al., 2010; Bellemare et al., 2017) , different from the value based methods (Watkins, 1989; Mnih et al., 2013) which focus on the expectation of the return, characterize the cumulative reward as a random variable and attempt to approximate its whole distribution. Most existing DRL methods fall into two main classes according to their ways to model the return distribution. One class, including categorical DQN (Bellemare et al., 2017), C51, and Rainbow (Hessel et al., 2018) , assumes that all the possible returns are bounded in a known range and learns the probability of each value through interacting with the environment. Another class, for example the QR-DQN (Dabney et al., 2018b) , tries to obtain the quantile estimates at some fixed locations by minimizing the Huber loss (Huber, 1992) of quantile regression (Koenker, 2005) . To more precisely parameterize the entire distribution, some quantile value based methods, such as IQN (Dabney et al., 2018a) and FQF (Yang et al., 2019) , are proposed to learn a continuous map from any quantile fraction τ ∈ (0, 1) to its estimate on the quantile curve. The theoretical validity of QR-DQN (Dabney et al., 2018b) , IQN (Dabney et al., 2018a) and FQF (Yang et al., 2019) heavily depends on a prerequisite that the approximated quantile curve is nondecreasing. Unfortunately, since no global constraint is imposed when simultaneously estimating the quantile values at multiple locations, the monotonicity can not be ensured by using any existing network design. At early training stage, the crossing issue is even more severe given limited training samples. Another problem to be solved is how to design an efficient exploration method for DRL. Most existing exploration techniques are originally designed for non-distributional RL (Bellemare et al., 2016; Ostrovski et al., 2017; Tang et al., 2017; Fox et al., 2018; Machado et al., 2018) , and very few of them can work for DRL. Mavrin et al. (2019) proposes a DRL-based exploration method, DLTV, for QR-DQN by using the left truncated variance as the exploration bonus. However, this approach can not be directly applied to quantile value based algorithms since the original DLTV method requires all the quantile locations to be fixed while IQN or FQF resamples the quantile locations at each training iteration and the bonus term could be extremely unstable. To address these two common issues in DRL studies, we propose a novel algorithm called Non-Decreasing Quantile Function Network (NDQFN), together with an efficient exploration strategy, distributional prediction error (DPE), designed for DRL. The NDQFN architecture allows us to approximate the quantile distribution of the return by using a non-decreasing piece-wise linear function. The monotonicity of N + 1 fixed quantile levels is ensured by an incremental structure, and the quantile value at any τ ∈ (0, 1) can be estimated as the weighted sum of its two nearest neighbors among the N + 1 locations. DPE uses the 1-Wasserstein distance between the quantile distributions estimated by the target network and the predictor network as an additional bonus when selecting the optimal action. We describe the implementation details of NDQFN and DPE and examine their performance on Atari 2600 games. We compare NDQFN and DPE with some baseline methods such as IQN and DLTV, and show that the combination of the two can consistently achieve the optimal performance especially in some hard-explored games such as Venture, Montezuma Revenge and Private Eye. For the rest of this paper, we first go through some background knowledge of distributional RL in Section 2. Then in Sections 3, we introduce NDQFN, DPE and describe their implementation details. Section 4 presents the experiment results using the Atari benchmark, investigating the empirical performance of NDQFN, DPE and their combination by comparing with some baseline methods.

2. BACKGROUND AND RELATED WORK

Following the standard reinforcement learning setting, the agent-environment interactions are modeled as a Markov Decision Process, or MDP, (X , A, R , P , γ) (Puterman, 2014). X and A denote a finite set of states and a finite set of actions, respectively. R : X × A → R is the reward function, and γ ∈ [0, 1) is a discounted factor. P : X × A × X → [0, 1] is the transition kernel. On the agent side, a stochastic policy π maps state x to a distribution over action a ∼ π(•|x) regardless of the time step t. The discounted cumulative rewards is denoted by a random variable Z π (x, a) = ∞ t=0 γ t R(x t , a t ), where x 0 = x, a 0 = a, x t ∼ P (•|x t-1 , a t-1 ) and a t ∼ π(•|x t ). The objective is to find an optimal policy π * to maximize the expectation of Z π (x, a), EZ π (x, a), which is denoted by the state-action value function Q π (x, a). A common way to obtain π * is to find the unique fixed point Q * = Q π * of the Bellman optimality operator T (Bellman, 1966): Q π (x, a) = T Q π (x, a) := E[R(x, a)] + γE P max a Q * (x , a ). The state-action value function Q can be approximated by a parameterized function Q θ (e.g. a neural network). Q -learning (Watkins, 1989) iteratively updates the network parameters by minimizing the squared temporal difference (TD) error δ 2 t = r t + γ max a ∈A Q θ (x t+1 , a ) -Q θ (x t , a t ) 2 , on a sampled transition (x t , a t , r t , x t+1 ), collected by running an -greedy policy over Q θ . Distributional RL algorithms, instead of directly estimating the mean EZ π (x, a), focus on the distribution Z π (x, a) to sufficiently capture the intrinsic randomness. The distributional Bellman operator, which has a similar structure with (1), is defined as follows,  T Z π (x, a) D = R(x, a) + γZ π (X , A ),

3. NON-MONOTONICITY IN DISTIRBUTIONAL REINFORCEMENT LEARNING

In Distributional RL studies, people usually pay attention to the quantile function F -1 Z (τ ) = inf{z ∈ R : τ ≤ F Z (z)} of total return Z, which is the the inverse of the cumulative distribution function F Z (z) = P r(Z < z). In practice, with limited observations at each quantile level, it is much likely that the obtained quantile estimates F -1 Z (τ ) at multiple given locations {τ 1 , τ 2 , . . . , τ N } for some state-action pair (x, a) are non-monotonic as Figure 1 (a) illustrates. The key reason behind this phenomenon is that the quantile values at multiple quantile fractions are separately estimated without applying any global constraint to ensure the monotonicity. Ignoring the non-decreasing



where X ∼ P (•|x, a) and A ∼ π(•|X ). A D = B denotes the equality in probability laws.

