DOUBLE Q-LEARNING: NEW ANALYSIS AND SHARPER FINITE-TIME BOUND

Abstract

Double Q-learning (Hasselt, 2010) has gained significant success in practice due to its effectiveness in overcoming the overestimation issue of Q-learning. However, theoretical understanding of double Q-learning is rather limited and the only existing finite-time analysis was recently established in Xiong et al. ( 2020) under a polynomial learning rate. This paper analyzes the more challenging case with a rescaled linear/constant learning rate for which the previous method does not appear to be applicable. We develop new analytical tools that achieve an orderlevel better finite-time convergence rate than the previously established result. Specifically, we show that synchronous double Q-learning attains an -accurate global optimum with a time complexity of Ω

ln D

(1-γ) 7 2 , and the asynchronous algorithm attains a time complexity of Ω L (1-γ) 7 2 , where D is the cardinality of the state-action space, γ is the discount factor, and L is a parameter related to the sampling strategy for asynchronous double Q-learning. These results improve the order-level dependence of the convergence rate on all major parameters ( , 1 -γ, D, L) provided in Xiong et al. (2020) . The new analysis in this paper presents a more direct and succinct approach for characterizing the finite-time convergence rate of double Q-learning.

1. INTRODUCTION

Double Q-learning proposed in Hasselt ( 2010) is a widely used model-free reinforcement learning (RL) algorithm in practice for searching for an optimal policy (Zhang et al., 2018a; b; Hessel et al., 2018) . Compared to the vanilla Q-learning proposed in Watkins & Dayan (1992) , double Q-learning uses two Q-estimators with their roles randomly selected at each iteration, respectively for estimating the maximum Q-function value and updating the Q-function. In this way, the overestimation of the action-value function in vanilla Q-learning can be effectively mitigated, especially when the reward is random or prone to errors (Hasselt, 2010; Hasselt et al., 2016; Xiong et al., 2020) . Moreover, double Q-learning has been shown to have the desired performance in both finite state-action setting (Hasselt, 2010) and infinite setting (Hasselt et al., 2016) where it successfully improved the performance of deep Q-network (DQN), and thus inspired many variants (Zhang et al., 2017; Abed-alguni & Ottom, 2018) subsequently. In parallel to its empirical success in practice, the theoretical convergence properties of double Qlearning has also been explored. Its asymptotic convergence was first established in Hasselt ( 2010). The asymptotic mean-square error for double Q-learning was studied in Weng et al. (2020c) under the assumption that the algorithm converges to a unique optimal policy. Furthermore, in Xiong et al. (2020) , the finite-time convergence rate has been established for double Q-learning with a polynomial learning rate α = 1/t ω , ω ∈ (0, 1). Under such a choice for the learning rate, they showed that double Q-learning attains an -accurate optimal Q-function at a time complexity approaching to but never reaching Ω( 12 ) at the cost of an asymptotically large exponent on 1 1-γ . However, a polynomial learning rate typically does not offer the best possible convergence rate, as having been shown for RL algorithms that a so-called rescaled linear learning rate (with a form of α t = a b+ct ) and a constant learning rate achieve a better convergence rate (Bhandari et al., 2018; Wainwright, 2019a; b; Chen et al., 2020; Qu & Wierman, 2020) . Therefore, a natural question arises as follows: Can a rescaled linear learning rate or a constant learning rate improve the convergence rate of double Q-learning order-wisely? If yes, does it also improve the dependence of the convergence rate on other important parameters of the Markov decision process (MDP) such as the discount factor and the cardinality of the state and action spaces? The answer to the above question does not follow immediately from Xiong et al. ( 2020), because the finite-time analysis framework in Xiong et al. (2020) does not handle such learning rates to yield a desirable result. This paper develops a novel analysis approach and provides affirmative answers to the above question.

1.1. OUR CONTRIBUTIONS

This paper establishes sharper finite-time bounds for double Q-learning with a rescaled linear/constant learning rate, which are orderwisely better than the existing bounds in Xiong et al. (2020) . We devise a different analysis approach from that in Xiong et al. ( 2020), which is more capable of handling variants of double Q-learning. • For synchronous double Q-learning, where all state-action pairs are visited at each iteration, we apply a rescaled linear learning rate α t = 3 3+(1-γ)t and show that the algorithm can attain an -accurate global optimum with a time complexity of Ω ln D (1-γ) 7 2 , where γ is the discount factor and D = |S||A| is the cardinality of the finite state-action space. As a comparison, for the dominated regime (with relatively small γ), our result attains an -accurate optimal Qfunction with a time complexity Ω( 12 ), whereas the result in Xiong et al. ( 2020) (see Table 1 ) does not exactly reach Ω( 12 ) and its approaching to such an order (η := 1 -ω → 0) is at an additional cost of an asymptotically large exponent on 1 1-γ . For 1 -γ dominated regime, our result improves on that in Xiong et al. ( 2020) (which has been optimized in the dependence on 1 -γ in Table 1) by O ln 1 1-γ 7 . • For asynchronous double Q-learning, where only one state-action pair is visited at each iteration, we obtain a time complexity of Ω

L

(1-γ) 7 2 , where L is a parameter related to the sampling strategy in Assumption 1. As illustrated in Table 1 , our result improves upon that in Xiong et al. (2020) order-wisely in terms of its dependence on and 1 -γ as well as on L by at least O L 5 . Our analysis takes a different approach from that in Xiong et al. (2020) in order to handle the rescaled linear/constant learning rate. More specifically, to deal with a pair of nested stochastic approximation (SA) recursions, we directly establish the dependence bound of the error dynamics (of the outer SA) between the Q-estimator and the global optimum on the error propagation (of the inner SA) between the two Q-estimators. Then we develop a bound on the inner SA, integrate it into that on the outer SA as a noise term, and establish the final convergence bound. This is a very different yet more direct approach than that in Xiong et al. (2020) , the latter of which captures the blockwise convergence by constructing two complicated block-wisely decreasing bounds for the two SAs. The sharpness of the bound also requires careful selection of the rescaled learning rates and proper usage of their properties.

1.2. RELATED WORK

Theory on double Q-learning: Double Q-learning was proposed and proved to converge asymptotically in Hasselt (2010). In Weng et al. (2020c) , the authors explored the properties of mean-square errors for double Q-learning both in the tabular case and with linear function approximation, under the assumption that a unique optimal policy exists and the algorithm can converge. The most relevant work to this paper is Xiong et al. (2020) , which established the first finite-time convergence rate for tabular double Q-learning with a polynomial learning rate. This paper provides sharper finite-time convergence bounds for double Q-learning, which requires a different analysis approach. Tabular Q-learning and convergence under various learning rates: Proposed in Watkins & Dayan (1992) under finite state-action space, Q-learning has aroused great interest in its theoretical study. Its asymptotic convergence has been established in Tsitsiklis (1994); Jaakkola et al. (1994) ;

