NON-DECREASING QUANTILE FUNCTION NETWORK WITH EFFICIENT EXPLORATION FOR DISTRIBUTIONAL REINFORCEMENT LEARNING

Abstract

Although distributional reinforcement learning (DRL) has been widely examined in the past few years, there are two open questions people are still trying to address. One is how to ensure the validity of the learned quantile function, the other is how to efficiently utilize the distribution information. This paper attempts to provide some new perspectives to encourage the future in-depth studies in these two fields. We first propose a non-decreasing quantile function network (NDQFN) to guarantee the monotonicity of the obtained quantile estimates and then design a general exploration framework called distributional prediction error (DPE) for DRL which utilizes the entire distribution of the quantile function. In this paper, we not only discuss the theoretical necessity of our method but also show the performance gain it achieves in practice by comparing with some competitors on Atari 2600 Games especially in some hard-explored games.

1. INTRODUCTION

Distributional reinforcement learning (DRL) algorithms (Jaquette, 1973; Sobel, 1982; White, 1988; Morimura et al., 2010; Bellemare et al., 2017) , different from the value based methods (Watkins, 1989; Mnih et al., 2013) which focus on the expectation of the return, characterize the cumulative reward as a random variable and attempt to approximate its whole distribution. Most existing DRL methods fall into two main classes according to their ways to model the return distribution. One class, including categorical DQN (Bellemare et al., 2017) , C51, and Rainbow (Hessel et al., 2018) , assumes that all the possible returns are bounded in a known range and learns the probability of each value through interacting with the environment. Another class, for example the QR-DQN (Dabney et al., 2018b) , tries to obtain the quantile estimates at some fixed locations by minimizing the Huber loss (Huber, 1992) of quantile regression (Koenker, 2005) . To more precisely parameterize the entire distribution, some quantile value based methods, such as IQN (Dabney et al., 2018a) and FQF (Yang et al., 2019) , are proposed to learn a continuous map from any quantile fraction τ ∈ (0, 1) to its estimate on the quantile curve. The theoretical validity of QR-DQN (Dabney et al., 2018b) , IQN (Dabney et al., 2018a) and FQF (Yang et al., 2019) heavily depends on a prerequisite that the approximated quantile curve is nondecreasing. Unfortunately, since no global constraint is imposed when simultaneously estimating the quantile values at multiple locations, the monotonicity can not be ensured by using any existing network design. At early training stage, the crossing issue is even more severe given limited training samples. Another problem to be solved is how to design an efficient exploration method for DRL. Most existing exploration techniques are originally designed for non-distributional RL (Bellemare et al., 2016; Ostrovski et al., 2017; Tang et al., 2017; Fox et al., 2018; Machado et al., 2018) , and very few of them can work for DRL. Mavrin et al. (2019) proposes a DRL-based exploration method, DLTV, for QR-DQN by using the left truncated variance as the exploration bonus. However, this approach can not be directly applied to quantile value based algorithms since the original DLTV method requires all the quantile locations to be fixed while IQN or FQF resamples the quantile locations at each training iteration and the bonus term could be extremely unstable. To address these two common issues in DRL studies, we propose a novel algorithm called Non-Decreasing Quantile Function Network (NDQFN), together with an efficient exploration strategy, distributional prediction error (DPE), designed for DRL. The NDQFN architecture allows us to approximate the quantile distribution of the return by using a non-decreasing piece-wise linear function. The monotonicity of N + 1 fixed quantile levels is ensured by an incremental structure, and the quantile value at any τ ∈ (0, 1) can be estimated as the weighted sum of its two nearest neighbors among the N + 1 locations. DPE uses the 1-Wasserstein distance between the quantile distributions estimated by the target network and the predictor network as an additional bonus when selecting the optimal action. We describe the implementation details of NDQFN and DPE and examine their performance on Atari 2600 games. We compare NDQFN and DPE with some baseline methods such as IQN and DLTV, and show that the combination of the two can consistently achieve the optimal performance especially in some hard-explored games such as Venture, Montezuma Revenge and Private Eye. For the rest of this paper, we first go through some background knowledge of distributional RL in Section 2. Then in Sections 3, we introduce NDQFN, DPE and describe their implementation details. Section 4 presents the experiment results using the Atari benchmark, investigating the empirical performance of NDQFN, DPE and their combination by comparing with some baseline methods.

2. BACKGROUND AND RELATED WORK

Following the standard reinforcement learning setting, the agent-environment interactions are modeled as a Markov Decision Process, or MDP, (X , A, R , P , γ) (Puterman, 2014). X and A denote a finite set of states and a finite set of actions, respectively. R : X × A → R is the reward function, and γ ∈ [0, 1) is a discounted factor. P : X × A × X → [0, 1] is the transition kernel. On the agent side, a stochastic policy π maps state x to a distribution over action a ∼ π(•|x) regardless of the time step t. The discounted cumulative rewards is denoted by a random variable Z π (x, a) = ∞ t=0 γ t R(x t , a t ), where x 0 = x, a 0 = a, x t ∼ P (•|x t-1 , a t-1 ) and a t ∼ π(•|x t ). The objective is to find an optimal policy π * to maximize the expectation of Z π (x, a), EZ π (x, a), which is denoted by the state-action value function Q π (x, a). A common way to obtain π * is to find the unique fixed point Q * = Q π * of the Bellman optimality operator T (Bellman, 1966) : Q π (x, a) = T Q π (x, a) := E[R(x, a)] + γE P max a Q * (x , a ). The state-action value function Q can be approximated by a parameterized function Q θ (e.g. a neural network). Q -learning (Watkins, 1989) iteratively updates the network parameters by minimizing the squared temporal difference (TD) error δ 2 t = r t + γ max a ∈A Q θ (x t+1 , a ) -Q θ (x t , a t ) 2 , on a sampled transition (x t , a t , r t , x t+1 ), collected by running an -greedy policy over Q θ . Distributional RL algorithms, instead of directly estimating the mean EZ π (x, a), focus on the distribution Z π (x, a) to sufficiently capture the intrinsic randomness. The distributional Bellman operator, which has a similar structure with (1), is defined as follows,  T Z π (x, a) D = R(x, a) + γZ π (X , A ),

3. NON-MONOTONICITY IN DISTIRBUTIONAL REINFORCEMENT LEARNING

In Distributional RL studies, people usually pay attention to the quantile function F -1 Z (τ ) = inf{z ∈ R : τ ≤ F Z (z)} of total return Z, which is the the inverse of the cumulative distribution function F Z (z) = P r(Z < z). In practice, with limited observations at each quantile level, it is much likely that the obtained quantile estimates F -1 Z (τ ) at multiple given locations {τ 1 , τ 2 , . . . , τ N } for some state-action pair (x, a) are non-monotonic as Figure 1 (a) illustrates. The key reason behind this phenomenon is that the quantile values at multiple quantile fractions are separately estimated without applying any global constraint to ensure the monotonicity. Ignoring the non-decreasing property of the learned quantile function leaves a theory-practice gap, which results in the potentially non-optimal action selection in practice. Although the crossing issue has been broadly studied by the statistics community (He, 1997; Chernozhukov et al., 2010; Dette & Volgushev, 2008) , how to ensure the monotonicity of the approximated quantile function in DRL still remains challenging, especially to some quantile value based algorithms such as IQN and FQF, which do not focus on fixed quantile locations during the training process. 

4. NON-DECREASING QUANTILE FUNCTION NETWORK

To address the crossing issue, we introduce a novel Non-Decreasing Quantile Function Network (NDQFN) with two main components: (1). an incremental structure to estimate the increase of quantile values between two pre-defined nearby supporting points p i-1 ∈ (0, 1) and p i ∈ (0, 1), i.e., ∆ i (x, a; p) = F -1 Z (p i ) -F -1 Z (p i-1 ), and subsequently obtain each F -1 Z (p i ) for i ∈ {0, . . . , N } as the cumulative sum of ∆ i s; (2). a piece-wise linear function which connects the N + 1 supporting points to represent the quantile estimate F -1 Z(x,a) (τ ) at any given fraction τ ∈ (0, 1). Figure 2 describes the whole architecture of NDQFN. We first describe the incremental structure. Let p = {p 0 , • • • , p N } be a set of N + 1 supporting points, satisfying 0 <≤ p i ≤ p i+1 < 1 for each i ∈ {1, 2, . . . , N -1}. Thus, Z(x, a) can be parameterized by a mixture of N Diracs, such that Z θ,p (x, a) := N -1 i=0 (p i+1 -p i )δ θi(x,a) , where each θ i denotes the quantile estimation at pi = pi+pi+1

2

, and θ = {θ 0 , . . . , θ N -1 }. Let ψ : X → R d and φ : [0, 1] → R d represent the embeddings of state x and quantile fraction τ , respectively. The baseline value ∆ 0 (x, a; p) := F -1 Z(x,a) (p 0 ) at p 0 and the following N non-negative increments ∆ i (x, a; p) for i ∈ {1, • • • , N } can be represented by ∆ 0 (x, a; p) ≈ ∆ 0,ω (x, a; p) = f (ψ(x)) a , (3) ∆ i (x, a; p) ≈ ∆ i,ω (x, a; p) = g(ψ(x), φ(p i ), φ(p i-1 )) a , i = 1, • • • , N. where f : R d → R |A| and g : R 2d → [0, ∞) |A| are two functions to be learned. ω includes all the parameters of the network. Then, we use Π p,∆ω to denote a projection operator that projects the quantile function onto a piecewise linear function supported by p and the incremental structure ∆ ω = {∆ 0,ω , • • • , ∆ N,ω } above. For any quantile level τ ∈ (0, 1), its projection-based quantile estimate is given by F -1,p,∆ω Z(x,a) (τ ) = Π p,∆ω F -1 Z(x,a) (τ ) = ∆ 0,ω (x, a; p) + 1 N N -1 i=0 G i,∆ω (τ, x, a; p), where G i,∆ω could be regarded as a linear combination of quantile estimates at two nearby supporting points satisfying G i,∆ω (τ, x, a; p) =    i j=1 ∆ j,ω (x, a; p) + τ -p i p i+1 -p i ∆ i+1,ω (x, a; p)    I(p i ≤ τ < p i+1 ). By limiting the output range of g a in (3) to be [0, ∞), the obtained N + 1 quantile estimates are non-decreasing. Under the NDQFN framework, the expected future return starting from (x, a), also known as the Q-function can be empirically approximated by For notational simplicity, we let P τ,ω (x, a) = F -1,p,∆ω Z(x,a) Q ω = p N p0 F -1,p,∆ω Z(x,a) (τ )dτ = N -1 i=0 (p i+1 -p i )F -1,p,∆ω Z(x,a) ( p i + p i+1 2 ) = N -1 i=0 (p i+1 -p i ) 2 F -1,p,∆ω Z(x,a) (p i+1 ) + F -1,p,∆ω Z(x, (τ ), given the support p. Following the idea of IQN, two random sets of quantile fractions τ = {τ 1 , • • • , τ N1 }, τ = {τ 1 , • • • , τ N2 } are independently drawn from a uniform distribution U (0, 1) at each training iteration. In this case, for each i ∈ {1, . . . , N 1 } and each j ∈ {1, . . . , N 2 }, the corresponding temporal difference (TD) error (Dabney et al., 2018a) with n-step updates on (x t , a t , r t , • • • , r t+n-1 , x t+n ) is computed as follows, δ i,j = n-1 i=0 γ i r t+i + γ n P τ j ,ω x t+n , arg max a ∈A Q ω (x t+n , a ) -P τi,ω (x t , a t ), where ω and ω denote the online network and the target network, respectively. Thus, we can train the whole network by minimizing the Huber quantile regression loss (Huber, 1992) as follows, L(x t , a t , r t , • • • , r t+n-1 , x t+n ) = 1 N 2 N1 i=1 N2 j=1 ρ κ τi (δ i,j ) , where ρ κ τ (δ i,j ) = |τ -I (δ i,j < 0)| L κ (δ i,j ) κ , L κ (δ i,j ) =        1 2 δ 2 i,j , if |δ i,j | ≤ κ κ |δ i,j | - 1 2 κ , otherwise , κ is a pre-defined positive constant, and | • | denotes the absolute value of a scalar. Remark: As shown by Figure 1 (c), the piece-wise linear structure ensures the monotonicity within the union of any two quantile sets τ and τ from two different training iterations regardless of whether they are included in the p or not. However, as Figure 1 (b) demonstrates, directly applying a similar incremental structure onto IQN without using the piece-wise linear approximation may result in the non-monotonicity of τ ∪ τ although each of their own monotonicity is not violated. To investigate the convergence of the proposed algorithm, we introduce the following theorem, which can be seen an extension of Proposition 2 in Dabney et al. (2018b) Theorem 1. Let Π p,∆ω be the quantile projection defined as above with non-decreasing quantile function F -1,p,∆ω

Z

. For any two value distribution Z 1 , Z 2 ∈ Z for an MDP with countable state and action spaces and enough large N , d∞ Π p,∆ω T F -1,p,∆ω Z1 , Π p,∆ω T F -1,p,∆ω Z2 ≤ γ d∞ F -1,p,∆ω Z1 , F -1,p,∆ω Z2 , ( ) where T denotes the distributional bellman operator, dk ( F -1 Z1 , F -1 Z2 ) := sup x,a W k (F -1 Z1(x,a) , F -1 Z2(x,a) ) and W k (•, •) denotes the k-Wasserstein metric. By Theorem 1, we conclude that Π p,∆ω T have a unique fixed point and the repeated application of Π p,∆ω T converges to the fixed point. With dk ≤ d∞ , the convergence occurs for all k ∈ [1, ∞). It ensures that we can obtain a consistent estimator for F -1 Z , denoted as F -1,p,∆ω Z , by minimizing the Huber quantile regression loss defined in (6). Now we discuss the implementation details of NQDFN. For the support p used in this work, we let p i = i/N for i ∈ {1, • • • , N -1}, p 0 = 0.001 and p N = 0.999. NDQFN models the state embedding ψ(x) and the τ -embedding φ(τ ) in the same way as IQN. The baseline function f in (3) consists of two fully-connected layers and a sigmoid activation function, which takes ψ(x) as the input and returns an unconstrained value for ∆ 0 (x, a; p). The incremental function g shares the same structure with f but uses the ReLU activation instead to ensure the non-negativity of all the N increments ∆ i (x, a; p)'s. Let denote the element-wise product and g take ψ(x) φ(p i ) and φ(p i ) -φ(p i-1 ) as input, such that ∆ i,ω (x, a; p) = g(ψ(x) φ(p i ), φ(p i ) -φ(p i-1 )) a , i = 1, • • • , N, which to some extent captures the interactions among ψ(x), φ(p i ) and φ(p i-1 ). Although ψ(x) φ(p i ) φ(p i-1 ) may be another potential input form, the empirical results show that the combination of ψ(x) φ(p i ) and φ(p i ) -φ(p i-1 ) is more preferred in practice considering its outperformance in Atari games. More details are provided in Section C of the Supplement file.

5. EXPLORATION USING DISTRIBUTIONAL PREDICTION ERROR

To further improve the learning efficiency of NDQFN, we introduce a novel exploration approach called distributional prediction error (DPE), motivated by the Random Network Distillation (RND) method (Burda et al., 2018) , which can be extended to most existing DRL frameworks. The proposed method, DPE, involves three networks, online network, target network and predictor network, which have the same architecture but different network parameters. The online network ω and the target network ω use the same initialization, while the predictor network ω * is separately initialized. To be more specific, the predictor network are trained on the sampled data using the quantile Huber loss defined as follows: L(x t , a t ) = 1 N 2 N1 i=1 N2 j=1 ρ κ τi δ * i,j , where δ * i,j = P τ * j ,ω (x t , a t ) -P τi,ω * (x t , a t ) and ρ κ τ (•) are defined in (7). On the other hand, following Hasselt (2010) and Pathak et al. (2019) , the target network of DPE is periodically synchronized by the online network. DPE employs the 1-Wasserstein metric W 1 (•, •) to measure the distance between the two quantile distributions associated with the predictor network and the target network. And its empirical approximation based on NDQFN is defined as follows, i(x t , a t ) = p N p0 W 1 (P τ,ω (x t , a t ) -P τ,ω * (x t , a t )) dτ. ( ) Since the minimum of L(x t , a t ) is a contraction of the 1-Wasserstein distance (Bellemare et al., 2017) , the prediction error would be higher for an unobserved state-action pair than for a frequently visited one. Therefore, i(x t , a t ) can be treated as the exploration bonus when selecting the optimal action, which encourages the agent to explore unknown states. With i(x t , a t ) being the exploration bonus, the optimal action is determined by a t = arg max a [Q ω (x t , a) + c t i(x t , a)], where c t is the bonus rate. As might be expected, a more reasonable quantile estimate would highly enhance the exploration efficiency according to our exploration setting. To more clearly demonstrate this point, we compare the performance of DPE based on NDQFN and IQN in Section 6.2. On the other hand, we compared the exploration efficiency of distributional prediction error and value-based prediction error in Section D of supplement, where value-based prediction error is de- fined as i (x t , a t ) = |Q ω * (x t , a t ) -Q ω (x t , a t )|. As might be expected, DPE enhances the exploration efficiency by utilizing more distributional information.

6. EXPERIMENTS

In this section, we evaluate the empirical performance of the proposed NDQFN with DPE on 55 Atari games using the Arcade Learning Environment (ALE) (Bellemare et al., 2013) . The baseline algorithms we compare include IQN (Dabney et al., 2018a) , QR-DQN (Dabney et al., 2018b) , C51 (Bellemare et al., 2017) , prioritized experience replay (Schaul et al., 2015) and Rainbow (Hessel et al., 2018) . The whole architecture of our method is implemented by using the IQN baseline in the Dopamine framework (Castro et al., 2018) . Note that the whole method could be built upon other quantile value based baselines such as FQF. Our hyper-parameter setting is aligned with IQN for fair comparison. Furthermore, we incorporate n-step updates (Sutton, 1988) , double Q learning (Hasselt, 2010), quantile regression (Koenker, 2005) and distributional Bellman update (Dabney et al., 2018b) into the training for all compared methods. The size of p and the number of sampled quantile levels ,τ and τ , are 32, i.e. N = N 1 = N 2 = 32. We use -greedy policy with = 0.01 for training. At the evaluation stage, we test the agent for every 0.125 million frames with = 0.001. For the DPE exploration, we fix c t = 1 without decay (which is the optimal setting based on experiment results). As a comparison, we extend DLTV (Mavrin et al., 2019) , an exploration approach designed for DRL algorithms based on fixed quantile locations such as QR-DQN, to quantile value based methods by modifying the original left truncated variance used in their paper to σ 2 + = 1 1 2 [F -1 Z (τ ) -F -1 Z ( 1 2 )] 2 dτ . All training curves in this paper are smoothed with a moving average of 10 to improve the readability. With the same hyper-parameter setting, NDQFN with DPE is about 25% slower than IQN at training stage due to the added predictor network.

6.1. NDQFN VS IQN

In this part, we compare NDQFN with its baseline IQN (Dabney et al., 2018a) to verify the improvement it achieves by employing the non-decreasing structure. The two methods are fairly compared without employing any additional exploration strategy. Figure 3 visualizes the training curves of eleven randomly selected easy explored Atari games. NDQFN significantly outperforms IQN in most cases, including Berzerk, KungFu Master, Ms. Pac-Man, Qbert, River Raid, Up and Down, Video Pinball and Zaxxon. Although NDQFN achieves similar final scores with IQN in Battle Zone and Tennis, it learns much faster. This tells that the crossing issue, which usually happens in the early training stage with limited training samples, can be sufficiently addressed by the incremental design in NDQFN to ensure the monoticity of the quantile estimates. More related results could be seen in Section E of the supplement.

6.2. EFFECTS OF NDQFN ON DPE

In this section, we show how the NDQFN design can help improve the exploration efficiency of DPE compared to using the IQN baseline. We pick three hard explored games, Montezuma Revenge, As Figure 4 illustrates, DPE is an effective exploration method which highly increases the training performance of both IQN and NDQFN in three hard explored games. In the three easy-explored games, DPE still performs slightly better than the baseline methods, which to some extent demonstrates the stability and robustness of DPE. On the other hand, we find that NDQFN + DPE significantly outperforms IQN + DPE especially in some hard-explored games. This agrees with our conclusion in Section 5 that NDQFN obtains a more reasonable quantile estimate by adding the non-decreasing constraint which helps to increase the exploration efficiency.

6.3. DPE VS DLTV

We do some more experiments to show the advantage of DPE over DLTV when applying both of the two exploration approaches to quantile value based DRL algorithms. To be specific, we evaluate the performance of NDQFN+DPE and NDQFN+DLTV on the six games examined in the previous subsection. Figure 5 shows that DPE achieves a much better performance than DLTV especially in the three hard explored games. DLTV does not achieve significant performance gain over the baseline NDQFN without doing exploration. This result tells that DLTV, which computes the exploration bonus based on some discrete quantile locations, performs extremely unstable for quantile value based methods such as IQN and NDQFN since the quantile locations used to train the model are changed each time. As a contrast, DPE focuses on the entire distribution of the quantile function, which relies less on the selection of the quantile fractions, and thus performs better in this case.

6.4. FULL ATARI RESULTS

In this part, we evaluate the performance of the whole network which incorporates NDQFN with DPE in all the 55 Atari games. Following the IQN setting (Dabney et al., 2018a) , all evaluations start with 30 non-optimal actions to align with previous distributional RL works. 'IQN' in the table tries to reproduce the results presented in the original IQN paper, while 'IQN*' corresponds to the improved version of IQN by using n-step updates and some other tricks mentioned above. Yang et al. (2019) .

Mean

Table 1 presents the mean and median human normalized scores by the seven compared methods across 55 Atari games, which shows that NDQFN with DPE significantly outperforms the IQN baseline and other state-of-the-art algorithms such as Rainbow. In particular, Figure 6 plots the relative improvement of NDQFN+DPE over IQN ((NDQFN-random)/(IQN-random)-1) in testing score for all the 55 Atari games. We observe that our method consistently outperforms the IQN baseline in most situations, especially for some hard games with extremely sparse reward spaces, such as Private Eye and Montezuma Revenge. More comparison results including the training curves for all the 55 games and the raw scores are provided in Sections F and G in the supplement. 

7. CONCLUSION AND DISSCUSION

In this paper, we make some attempts to address two important issues people care about for distributional reinforcement learning studies. We first propose a Non-decreasing Quantile Function Network (NDQFN) to ensure the validity of the learned quantile function by adding the monotonicity constraint to the quantile estimates at different locations via a piece-wise linear incremental structure. We also introduce DPE, a general exploration method for all kinds of distributional RL frameworks, which utilizes the information of the entire distribution and performs much more efficiently than the existing methods. Experiment results on more than 50+ Atari games illustrates NDQFN can more precisely model the quantile distribution than the baseline methods. On the other hand, DPE perform better in both hard-explored Atari games and easy ones than DLTV. Some ablation studies show that the combination of NDQFN with DPE achieves the best performance than all the others. There are some questions remain unsolved. First, since lots of non-negative functions are available, is Relu a good choice for function g? Second, will a better approximated distribution affect agent's policy? If so, how does it affect the training process as NDQFN can provide a potentially better distribution approximation. More generally, how to analytically compare the optimal policies NDQFN and IQN converge to. Third, how much does the periodically synchronized target network affect the efficiency of the DPE exploration. For the future work, we first want to apply NDQFN to other quantile value based methods such as FQF to see if the incremental structure can help improve the performance of all the existing DRL algorithms. Second, as there exists some other kind of DRL methods which naturally ensures the monotonicity of the learned quantile function (Yue et al., 2020) but requies extremely large computation cost, it may be interesting to integrate our method with these methods to obtain an improved empirical performance and training efficiency.



where X ∼ P (•|x, a) and A ∼ π(•|X ). A D = B denotes the equality in probability laws.

Figure 1: Quantile estimates obtained by (a) IQN, (b) mere incremental structure and (c) NDQFN under 50M training on Breakout. τ and τ denote two sets of sampled quantile fractions from two different training iterations.

Figure 2: The network architecture of NDQFN.

Figure 3: Training curve on Atari games for NDQFN and IQN.

Figure 4: Comparison between NDQFN+DPE and IQN+DPE.

Figure 5: Comparison between DPE and DLTV.

Figure 6: Cumulative rewards performance comparison between our method and IQN with 200M training frames. The bars represent relative gain/loss of NDQFN+DPE over IQN.

Mean and median of scores across 55 Atari 2600 games, measured as percentages of human baseline. Scores of previous work are referenced from Castro et al. (2018) and

