ON THE ESTIMATION BIAS IN DOUBLE Q-LEARNING Anonymous authors Paper under double-blind review

Abstract

Double Q-learning is a classical method for reducing overestimation bias, which is caused by taking maximum estimated values in the Bellman operator. Its variants in the deep Q-learning paradigm have shown great promise in producing reliable value prediction and improving learning performance. However, as shown by prior work, double Q-learning is not fully unbiased and still suffers from underestimation bias. In this paper, we show that such underestimation bias may lead to multiple non-optimal fixed points under an approximated Bellman operation. To address the concerns of converging to non-optimal stationary solutions, we propose a simple and effective approach as a partial fix for underestimation bias in double Q-learning. This approach leverages real returns to bound the target value. We extensively evaluate the proposed method in the Atari benchmark tasks and demonstrate its significant improvement over baseline algorithms.

1. INTRODUCTION

Value-based reinforcement learning with neural networks as function approximators has become a widely-used paradigm and shown great promise in solving complicated decision-making problems in various real-world applications, including robotics control (Lillicrap et al., 2016) , molecular structure design (Zhou et al., 2019) , and recommendation systems (Chen et al., 2018) . Towards understanding the foundation of these successes, investigating algorithmic properties of deep-learningbased value function approximation has been seen a growth of attention in recent years (Van Hasselt et al., 2018; Fu et al., 2019; Achiam et al., 2019; Dong et al., 2020) . One of the phenomena of interest is that Q-learning (Watkins, 1989 ) is known to suffer from overestimation issues, since it takes a maximum operator over estimated action-values. Comparing with underestimated values, overestimation errors are more likely to be propagated through greedy action selections, which leads to an overestimation bias in value prediction (Thrun & Schwartz, 1993) . This overoptimistic behavior of decision making has also been investigated in the literature of management science (Smith & Winkler, 2006) and economics (Thaler, 1988) . From a statistical perspective, the value estimation error may come from many sources, such as the stochasticity of the environment and the imperfection of the function expressivity. However, for deep Q-learning algorithms, even if most benchmark environments are nearly deterministic (Brockman et al., 2016) and millions of samples are collected, the overestimation phenomenon is still dramatic (Hasselt et al., 2016) . One cause of this problematic issue is the difficulty of optimization. Although a deep neural network may have a sufficient expressiveness power to represent an accurate value function, the back-end optimization is hard to solve. As a result of computational considerations, stochastic gradient descent is almost the default choice for deep reinforcement learning algorithms. The high variance of such stochastic methods in gradient estimation would lead to an unavoidable approximation error in value prediction. This kind of approximation error cannot be addressed by simply increasing sample size and network capacity, which is a major source of overestimation bias. Double Q-learning is a classical method to reduce the risk of overestimation, which is a specific variant of the double estimator (Stone, 1974) in the Q-learning paradigm. It uses a second value function to construct an independent action-value evaluation as cross validation. With proper assumptions, double Q-learning was proved to underestimate rather than overestimate the maximum expected values (Van Hasselt, 2010) . However, in practice, obtaining two independent value estimators is usually intractable in large-scale tasks, which makes double Q-learning still suffer from overestimation in some situations. To address these empirical concerns, Fujimoto et al. ( 2018) proposed a variant named clipped double Q-learning, which takes the minimum over two value estimations. This approach implicitly penalizes regions with high uncertainty (Fujimoto et al., 2019) and thus significantly repress the incentive of overestimation. In this paper, we first review an analytical model adopted by prior work (Thrun & Schwartz, 1993; Lan et al., 2020) and reveal a fact that, due to the existence of underestimation biases, both double Q-learning and clipped double Q-learning have multiple approximated fixed points in this model. This result raises a concern that double Q-learning may easily get stuck in some local stationary regions and become inefficient in searching for the optimal policy. To bootstrap the ability of double Q-learning, we propose a simple heuristic that utilizes real return signals as a lower bound estimation to rule out the potential non-optimal fixed points. Benefiting from its simplicity, this method is easy to be combined with other existing techniques such as clipped double Q-learning. In the experiments on Atari benchmark tasks, we demonstrate that this simple approach is effective both in improving sample efficiency and convergence performance.

2. BACKGROUND

Markov Decision Process (MDP; Bellman, 1957) is a classical framework to formalize an agentenvironment interaction system which can be defined as a tuple M = S, A, P, R, γ . We use S and A to denote the state and action space, respectively. P (s |s, a) and R(s, a) denote the transition and reward functions, which are initially unknown to the agent. γ is the discount factor. The goal of reinforcement learning is to construct a policy π : S → A maximizing discounted cumulative rewards, V π (s) = E ∞ t=0 γ t R(s t , π(s t )) s 0 = s, s t+1 ∼ P (•|s t , π(s t )) . Another quantity of interest in policy learning can be defined through the Bellman equation Q π (s, a) = R(s, a) + γE s ∼P (•|s,a) [V π (s )]. The optimal value function Q * corresponds to the unique solution of the Bellman optimality equation, ∀(s, a) ∈ S × A, Q * (s, a) = R(s, a) + γ E s ∼P (•|s,a) max a ∈A Q * (s , a ) . Q-learning algorithms are based on the Bellman operator T stated as follows: (T Q)(s, a) = R(s, a) + γ E s ∼P (•|s,a) max a ∈A Q(s , a ) . By iterating this operator, value iteration is proved to converge to the optimal value function Q * . To extend Q-learning methods to real-world applications, function approximation is indispensable to deal with a high-dimensional state space. Deep Q-learning (Mnih et al., 2015) considers a samplebased objective function and constructs an iterative optimization framework: θ t+1 ← arg min θ∈Θ E (s,a,r,s )∼D r + γ max a ∈A Q θt (s , a ) -Q θ (s, a) 2 , in which Θ denotes the parameter space of the value network, and θ 0 ∈ Θ is initialized by some predetermined method. D is the data distribution which is changing during exploration. With infinite samples and a sufficiently rich function class, the update rule stated in Eq. ( 2) is asymptotically equivalent to applying the Bellman operator T , but the underlying optimization is usually inefficient in practice. In deep Q-learning, Eq. ( 2) is optimized by mini-batch gradient descent and thus its value estimation suffers from unavoidable approximation errors.

3. EFFECTS OF UNDERESTIMATION BIAS IN DOUBLE Q-LEARNING

In this section, we will first review a common analytical model used by previous work studying estimation bias (Thrun & Schwartz, 1993; Lan et al., 2020) , in which double Q-learning is known to have underestimation bias. Based on this analytical model, we show that its underestimation bias could make double Q-learning have multiple fixed point solutions under an approximated Bellman operation. This result suggests that double Q-learning may have extra non-optimal stationary solutions under the effects of the approximation error.

