DOUBLE Q-LEARNING: NEW ANALYSIS AND SHARPER FINITE-TIME BOUND

Abstract

Double Q-learning (Hasselt, 2010) has gained significant success in practice due to its effectiveness in overcoming the overestimation issue of Q-learning. However, theoretical understanding of double Q-learning is rather limited and the only existing finite-time analysis was recently established in Xiong et al. ( 2020) under a polynomial learning rate. This paper analyzes the more challenging case with a rescaled linear/constant learning rate for which the previous method does not appear to be applicable. We develop new analytical tools that achieve an orderlevel better finite-time convergence rate than the previously established result. Specifically, we show that synchronous double Q-learning attains an -accurate global optimum with a time complexity of Ω

ln D

(1-γ) 7 2 , and the asynchronous algorithm attains a time complexity of Ω L (1-γ) 7 2 , where D is the cardinality of the state-action space, γ is the discount factor, and L is a parameter related to the sampling strategy for asynchronous double Q-learning. These results improve the order-level dependence of the convergence rate on all major parameters ( , 1 -γ, D, L) provided in Xiong et al. (2020) . The new analysis in this paper presents a more direct and succinct approach for characterizing the finite-time convergence rate of double Q-learning.

1. INTRODUCTION

Double Q-learning proposed in Hasselt ( 2010) is a widely used model-free reinforcement learning (RL) algorithm in practice for searching for an optimal policy (Zhang et al., 2018a; b; Hessel et al., 2018) . Compared to the vanilla Q-learning proposed in Watkins & Dayan (1992) , double Q-learning uses two Q-estimators with their roles randomly selected at each iteration, respectively for estimating the maximum Q-function value and updating the Q-function. In this way, the overestimation of the action-value function in vanilla Q-learning can be effectively mitigated, especially when the reward is random or prone to errors (Hasselt, 2010; Hasselt et al., 2016; Xiong et al., 2020) . Moreover, double Q-learning has been shown to have the desired performance in both finite state-action setting (Hasselt, 2010) and infinite setting (Hasselt et al., 2016) where it successfully improved the performance of deep Q-network (DQN), and thus inspired many variants (Zhang et al., 2017; Abed-alguni & Ottom, 2018) subsequently. In parallel to its empirical success in practice, the theoretical convergence properties of double Qlearning has also been explored. Its asymptotic convergence was first established in Hasselt ( 2010). The asymptotic mean-square error for double Q-learning was studied in Weng et al. (2020c) under the assumption that the algorithm converges to a unique optimal policy. Furthermore, in Xiong et al. (2020) , the finite-time convergence rate has been established for double Q-learning with a polynomial learning rate α = 1/t ω , ω ∈ (0, 1). Under such a choice for the learning rate, they showed that double Q-learning attains an -accurate optimal Q-function at a time complexity approaching to but never reaching Ω( 12 ) at the cost of an asymptotically large exponent on 1 1-γ . However, a polynomial learning rate typically does not offer the best possible convergence rate, as having been shown for RL algorithms that a so-called rescaled linear learning rate (with a form of α t = a b+ct ) and a constant learning rate achieve a better convergence rate (Bhandari et al., 2018; Wainwright, 2019a; b; Chen et al., 2020; Qu & Wierman, 2020) . Therefore, a natural question arises as follows: 1

