ON THE ESTIMATION BIAS IN DOUBLE Q-LEARNING Anonymous authors Paper under double-blind review

Abstract

Double Q-learning is a classical method for reducing overestimation bias, which is caused by taking maximum estimated values in the Bellman operator. Its variants in the deep Q-learning paradigm have shown great promise in producing reliable value prediction and improving learning performance. However, as shown by prior work, double Q-learning is not fully unbiased and still suffers from underestimation bias. In this paper, we show that such underestimation bias may lead to multiple non-optimal fixed points under an approximated Bellman operation. To address the concerns of converging to non-optimal stationary solutions, we propose a simple and effective approach as a partial fix for underestimation bias in double Q-learning. This approach leverages real returns to bound the target value. We extensively evaluate the proposed method in the Atari benchmark tasks and demonstrate its significant improvement over baseline algorithms.

1. INTRODUCTION

Value-based reinforcement learning with neural networks as function approximators has become a widely-used paradigm and shown great promise in solving complicated decision-making problems in various real-world applications, including robotics control (Lillicrap et al., 2016) , molecular structure design (Zhou et al., 2019) , and recommendation systems (Chen et al., 2018) . Towards understanding the foundation of these successes, investigating algorithmic properties of deep-learningbased value function approximation has been seen a growth of attention in recent years (Van Hasselt et al., 2018; Fu et al., 2019; Achiam et al., 2019; Dong et al., 2020) . One of the phenomena of interest is that Q-learning (Watkins, 1989 ) is known to suffer from overestimation issues, since it takes a maximum operator over estimated action-values. Comparing with underestimated values, overestimation errors are more likely to be propagated through greedy action selections, which leads to an overestimation bias in value prediction (Thrun & Schwartz, 1993) . This overoptimistic behavior of decision making has also been investigated in the literature of management science (Smith & Winkler, 2006) and economics (Thaler, 1988) . From a statistical perspective, the value estimation error may come from many sources, such as the stochasticity of the environment and the imperfection of the function expressivity. However, for deep Q-learning algorithms, even if most benchmark environments are nearly deterministic (Brockman et al., 2016) and millions of samples are collected, the overestimation phenomenon is still dramatic (Hasselt et al., 2016) . One cause of this problematic issue is the difficulty of optimization. Although a deep neural network may have a sufficient expressiveness power to represent an accurate value function, the back-end optimization is hard to solve. As a result of computational considerations, stochastic gradient descent is almost the default choice for deep reinforcement learning algorithms. The high variance of such stochastic methods in gradient estimation would lead to an unavoidable approximation error in value prediction. This kind of approximation error cannot be addressed by simply increasing sample size and network capacity, which is a major source of overestimation bias. Double Q-learning is a classical method to reduce the risk of overestimation, which is a specific variant of the double estimator (Stone, 1974) in the Q-learning paradigm. It uses a second value function to construct an independent action-value evaluation as cross validation. With proper assumptions, double Q-learning was proved to underestimate rather than overestimate the maximum expected values (Van Hasselt, 2010) . However, in practice, obtaining two independent value estimators is usually intractable in large-scale tasks, which makes double Q-learning still suffer from overestimation in some situations. To address these empirical concerns, Fujimoto et al. ( 2018) proposed a variant named clipped double Q-learning, which takes the minimum over two value estimations.

