NON-DECREASING QUANTILE FUNCTION NETWORK WITH EFFICIENT EXPLORATION FOR DISTRIBUTIONAL REINFORCEMENT LEARNING

Abstract

Although distributional reinforcement learning (DRL) has been widely examined in the past few years, there are two open questions people are still trying to address. One is how to ensure the validity of the learned quantile function, the other is how to efficiently utilize the distribution information. This paper attempts to provide some new perspectives to encourage the future in-depth studies in these two fields. We first propose a non-decreasing quantile function network (NDQFN) to guarantee the monotonicity of the obtained quantile estimates and then design a general exploration framework called distributional prediction error (DPE) for DRL which utilizes the entire distribution of the quantile function. In this paper, we not only discuss the theoretical necessity of our method but also show the performance gain it achieves in practice by comparing with some competitors on Atari 2600 Games especially in some hard-explored games.

1. INTRODUCTION

Distributional reinforcement learning (DRL) algorithms (Jaquette, 1973; Sobel, 1982; White, 1988; Morimura et al., 2010; Bellemare et al., 2017) , different from the value based methods (Watkins, 1989; Mnih et al., 2013) which focus on the expectation of the return, characterize the cumulative reward as a random variable and attempt to approximate its whole distribution. Most existing DRL methods fall into two main classes according to their ways to model the return distribution. One class, including categorical DQN (Bellemare et al., 2017), C51, and Rainbow (Hessel et al., 2018) , assumes that all the possible returns are bounded in a known range and learns the probability of each value through interacting with the environment. Another class, for example the QR-DQN (Dabney et al., 2018b) , tries to obtain the quantile estimates at some fixed locations by minimizing the Huber loss (Huber, 1992) of quantile regression (Koenker, 2005) . To more precisely parameterize the entire distribution, some quantile value based methods, such as IQN (Dabney et al., 2018a) and FQF (Yang et al., 2019) , are proposed to learn a continuous map from any quantile fraction τ ∈ (0, 1) to its estimate on the quantile curve. The theoretical validity of QR-DQN (Dabney et al., 2018b) , IQN (Dabney et al., 2018a) and FQF (Yang et al., 2019) heavily depends on a prerequisite that the approximated quantile curve is nondecreasing. Unfortunately, since no global constraint is imposed when simultaneously estimating the quantile values at multiple locations, the monotonicity can not be ensured by using any existing network design. At early training stage, the crossing issue is even more severe given limited training samples. Another problem to be solved is how to design an efficient exploration method for DRL. Most existing exploration techniques are originally designed for non-distributional RL (Bellemare et al., 2016; Ostrovski et al., 2017; Tang et al., 2017; Fox et al., 2018; Machado et al., 2018) , and very few of them can work for DRL. Mavrin et al. (2019) proposes a DRL-based exploration method, DLTV, for QR-DQN by using the left truncated variance as the exploration bonus. However, this approach can not be directly applied to quantile value based algorithms since the original DLTV method requires all the quantile locations to be fixed while IQN or FQF resamples the quantile locations at each training iteration and the bonus term could be extremely unstable. To address these two common issues in DRL studies, we propose a novel algorithm called Non-Decreasing Quantile Function Network (NDQFN), together with an efficient exploration strategy,

