DOUBLE Q-LEARNING: NEW ANALYSIS AND SHARPER FINITE-TIME BOUND

Abstract

Double Q-learning (Hasselt, 2010) has gained significant success in practice due to its effectiveness in overcoming the overestimation issue of Q-learning. However, theoretical understanding of double Q-learning is rather limited and the only existing finite-time analysis was recently established in Xiong et al. (2020) under a polynomial learning rate. This paper analyzes the more challenging case with a rescaled linear/constant learning rate for which the previous method does not appear to be applicable. We develop new analytical tools that achieve an orderlevel better finite-time convergence rate than the previously established result. Specifically, we show that synchronous double Q-learning attains an -accurate global optimum with a time complexity of Ω ln D (1-γ) 7 2 , and the asynchronous algorithm attains a time complexity of Ω L (1-γ) 7 2 , where D is the cardinality of the state-action space, γ is the discount factor, and L is a parameter related to the sampling strategy for asynchronous double Q-learning. These results improve the order-level dependence of the convergence rate on all major parameters ( , 1 -γ, D, L) provided in Xiong et al. (2020) . The new analysis in this paper presents a more direct and succinct approach for characterizing the finite-time convergence rate of double Q-learning.

1. INTRODUCTION

Double Q-learning proposed in Hasselt ( 2010) is a widely used model-free reinforcement learning (RL) algorithm in practice for searching for an optimal policy (Zhang et al., 2018a; b; Hessel et al., 2018) . Compared to the vanilla Q-learning proposed in Watkins & Dayan (1992) , double Q-learning uses two Q-estimators with their roles randomly selected at each iteration, respectively for estimating the maximum Q-function value and updating the Q-function. In this way, the overestimation of the action-value function in vanilla Q-learning can be effectively mitigated, especially when the reward is random or prone to errors (Hasselt, 2010; Hasselt et al., 2016; Xiong et al., 2020) . Moreover, double Q-learning has been shown to have the desired performance in both finite state-action setting (Hasselt, 2010) and infinite setting (Hasselt et al., 2016) where it successfully improved the performance of deep Q-network (DQN), and thus inspired many variants (Zhang et al., 2017; Abed-alguni & Ottom, 2018) subsequently. In parallel to its empirical success in practice, the theoretical convergence properties of double Qlearning has also been explored. Its asymptotic convergence was first established in Hasselt (2010) . The asymptotic mean-square error for double Q-learning was studied in Weng et al. (2020c) under the assumption that the algorithm converges to a unique optimal policy. Furthermore, in Xiong et al. (2020) , the finite-time convergence rate has been established for double Q-learning with a polynomial learning rate α = 1/t ω , ω ∈ (0, 1). Under such a choice for the learning rate, they showed that double Q-learning attains an -accurate optimal Q-function at a time complexity approaching to but never reaching Ω( 12 ) at the cost of an asymptotically large exponent on 1 1-γ . However, a polynomial learning rate typically does not offer the best possible convergence rate, as having been shown for RL algorithms that a so-called rescaled linear learning rate (with a form of α t = a b+ct ) and a constant learning rate achieve a better convergence rate (Bhandari et al., 2018; Wainwright, 2019a; b; Chen et al., 2020; Qu & Wierman, 2020) . Therefore, a natural question arises as follows: Can a rescaled linear learning rate or a constant learning rate improve the convergence rate of double Q-learning order-wisely? If yes, does it also improve the dependence of the convergence rate on other important parameters of the Markov decision process (MDP) such as the discount factor and the cardinality of the state and action spaces? The answer to the above question does not follow immediately from Xiong et al. (2020) , because the finite-time analysis framework in Xiong et al. (2020) does not handle such learning rates to yield a desirable result. This paper develops a novel analysis approach and provides affirmative answers to the above question.

1.1. OUR CONTRIBUTIONS

This paper establishes sharper finite-time bounds for double Q-learning with a rescaled linear/constant learning rate, which are orderwisely better than the existing bounds in Xiong et al. (2020) . We devise a different analysis approach from that in Xiong et al. (2020) , which is more capable of handling variants of double Q-learning. • For synchronous double Q-learning, where all state-action pairs are visited at each iteration, we apply a rescaled linear learning rate α t = 3 3+(1-γ)t and show that the algorithm can attain an -accurate global optimum with a time complexity of Ω ln D (1-γ) 7 2 , where γ is the discount factor and D = |S||A| is the cardinality of the finite state-action space. As a comparison, for the dominated regime (with relatively small γ), our result attains an -accurate optimal Qfunction with a time complexity Ω( 12 ), whereas the result in Xiong et al. (2020) (see Table 1 ) does not exactly reach Ω( 12 ) and its approaching to such an order (η := 1 -ω → 0) is at an additional cost of an asymptotically large exponent on 1 1-γ . For 1 -γ dominated regime, our result improves on that in Xiong et al. (2020) (which has been optimized in the dependence on 1 -γ in Table 1) by O ln 1 1-γ 7 . • For asynchronous double Q-learning, where only one state-action pair is visited at each iteration, we obtain a time complexity of Ω

L

(1-γ) 7 2 , where L is a parameter related to the sampling strategy in Assumption 1. As illustrated in Table 1 , our result improves upon that in Xiong et al. (2020) order-wisely in terms of its dependence on and 1 -γ as well as on L by at least O L 5 . Our analysis takes a different approach from that in Xiong et al. (2020) in order to handle the rescaled linear/constant learning rate. More specifically, to deal with a pair of nested stochastic approximation (SA) recursions, we directly establish the dependence bound of the error dynamics (of the outer SA) between the Q-estimator and the global optimum on the error propagation (of the inner SA) between the two Q-estimators. Then we develop a bound on the inner SA, integrate it into that on the outer SA as a noise term, and establish the final convergence bound. This is a very different yet more direct approach than that in Xiong et al. (2020) , the latter of which captures the blockwise convergence by constructing two complicated block-wisely decreasing bounds for the two SAs. The sharpness of the bound also requires careful selection of the rescaled learning rates and proper usage of their properties.

1.2. RELATED WORK

Theory on double Q-learning: Double Q-learning was proposed and proved to converge asymptotically in Hasselt (2010). In Weng et al. (2020c) , the authors explored the properties of mean-square errors for double Q-learning both in the tabular case and with linear function approximation, under the assumption that a unique optimal policy exists and the algorithm can converge. The most relevant work to this paper is Xiong et al. (2020) , which established the first finite-time convergence rate for tabular double Q-learning with a polynomial learning rate. This paper provides sharper finite-time convergence bounds for double Q-learning, which requires a different analysis approach. Tabular Q-learning and convergence under various learning rates: Proposed in Watkins & Dayan (1992) under finite state-action space, Q-learning has aroused great interest in its theoretical study. Its asymptotic convergence has been established in Tsitsiklis (1994) ; Jaakkola et al. (1994) ; Table 1 : Comparison of time complexity for (a)synchronous double Q-learning. The choices ω → 1, ω = 6 7 , and ω = 2 3 respectively optimize the dependence of time complexity on , 1 -γ, and L in Xiong et al. (2020) . We denote a ∨ b = max{a, b}, a ∧ b = min{a, b}.

SyncDQ

Stepsize Time complexity Xiong et al. ( 2020) 1 t ω , ω ∈ ( 1 3 , 1) ω = 1 -η → 1 ω = 6/7 Ω 1 2+η ∨ 1 1-γ 1 η Ω 1 (1-γ) 7 1 3.5 ∨ ln 1 1-γ 7 This work 3 3+(1-γ)t Ω 1 2 Ω 1 (1-γ) 7 2 AsyncDQ Stepsize Time complexity Xiong et al. ( ) 1 t ω , ω ∈ ( 1 3 , 1) ω = 1 -η → 1 ω = 6/7 ω = 2/3 Ω 1 2+η ∨ 1 1-γ 1 η Ω 1 (1-γ) 7 1 3.5 ∨ ln 1 1-γ 7 Ω L 6 (ln L) 1.5 (1-γ) 9 3 This work 2 (1-γ) 6 ∧ 1 Ω 1 2 Ω 1 (1-γ) 7 2 Ω L (1-γ) 7 2 Borkar & Meyn (2000); Melo (2001); Lee & He (2019) by requiring the learning rates to satisfy ∞ t=0 α t = ∞ and ∞ t=0 α 2 t < ∞. Another line of research focuses on the finite-time analysis of Q-learning under different choices of the learning rates. Szepesvári (1998) captured the first convergence rate of Q-learning using a linear learning rate (i.e., α t = 1 t ). Under similar learning rates, Even-Dar & Mansour (2003) provided finite-time results for both synchronous and asynchronous Q-learning with a convergence rate being exponentially slow as a function of 1 1-γ . Another popular choice is the polynomial learning rate which has been studied for synchronous Q-learning in Wainwright (2019b) and for both synchronous/asynchronous Q-learning in Even-Dar & Mansour (2003) . With this learning rate, however, the convergence rate still has a gap with the lower bound of O( 1 √ T ) (Azar et al., 2013) . To handle this, a more sophisticated rescaled linear learning rate was introduced for synchronous Q-learning (Wainwright, 2019b; Chen et al., 2020) and asynchronous Q-learning (Qu & Wierman, 2020) , and thus yields a better convergence rate. The finite-time bounds for Q-learning were also given with constant stepsizes (Beck & Srikant, 2012; Chen et al., 2020; Li et al., 2020) . In this paper, we focus on the rescaled linear/constant learning rate and obtain sharper finite-time bounds for double Q-learning. Q-learning with function approximation: When the state-action space is considerably large or even infinite, the Q-function is usually approximated by a class of parameterized functions. In such a case, Q-learning has been shown not to converge in general (Baird, 1995) . Strong assumptions are typically needed to establish the convergence of Q-learning with linear function approximation (Bertsekas & Tsitsiklis, 1996; Melo et al., 2008; Zou et al., 2019; Chen et al., 2019; Du et al., 2019; Yang & Wang, 2019; Jia et al., 2019; Weng et al., 2020a; b) or neural network approximation (Cai et al., 2019; Xu & Gu, 2019) . The convergence analysis of double Q-learning with function approximation raises new technical challenges and can be an interesting topic for future study.

2. PRELIMINARIES ON DOUBLE Q-LEARNING

We consider a Markov decision process (MDP) over a finite state space S and a finite action space A with the total cardinality given by D := |S||A|. The transition kernel of the MDP is given by P : S × A × S → [0, 1] denoted as P(•|s, a). We denote the random reward function at time t as R t : S × A × S → [0, R max ], with E[R t (s, a, s )] = R s sa . A policy π := π(•|s) captures the conditional probability distribution over the action space given state s ∈ S. For a policy π, we define Q-function Q π ∈ R |S|×|A| as Q π (s, a) :=E ∞ t=1 γ t R t (s t , a t , s t ) s 1 = s, a 1 = a , where γ ∈ (0, 1) is the discount factor, a t ∼ π(•|s t ), and s t ∼ P(•|s t , a t ). Both vanilla Q-learning (Watkins & Dayan, 1992) and double Q-learning (Hasselt, 2010) aim to find the optimal Q-function Q * which is the unique fixed point of the Bellman operator T (Bertsekas & Tsitsiklis, 1996) given by T Q(s, a) = E s ∼P(•|s,a) R s sa + γmax a ∈A Q(s , a ) . Note that the Bellman operator T is γ-contractive which satisfies T Q -T Q ≤ γ Q -Q under the supremum norm Q := max s,a |Q(s, a)|. The idea of double Q-learning is to keep two Q-tables (i.e., Q-function estimators) Q A and Q B , and randomly choose one Q-table to update at each iteration based on the Bellman operator computed from the other Q-table. We next describe synchronous and asynchronous double Q-learning algorithms in more detail. Synchronous double Q-learning: Let {β t } t≥1 be a sequence of i.i.d. Bernoulli random variables satisfying P(β t = 0) = P(β t = 1) = 0.5. At each time t, β t = 0 indicates that Q B is updated, and otherwise Q A is updated. The update at time t ≥ 1 can be written in a compact form as, Q A t+1 (s, a) = (1 -α t β t )Q A t (s, a) + α t β t R t (s, a, s ) + γQ B t (s , a * ) , Q B t+1 (s, a) = (1 -α t (1 -β t )) Q B t (s, a) + α t (1 -β t ) R t (s, a, s ) + γQ A t (s , b * ) , for all (s, a) ∈ S × A, where s is sampled independently for each (s, a) by s ∼ P(•|s, a), a * = arg max a∈A Q A (s , a), b * = arg max a∈A Q B (s , a ) and α t is the learning rate. Note that the rewards for both updates of Q A t+1 and Q B t+1 are the same copy of R t . Asynchronous double Q-learning: Different from synchronous double Q-learning, at each iteration the asynchronous version samples only one state-action pair to update the chosen Q-estimator. That is, at time t, only the chosen Q-estimator and its value at the sampled state-action pair (s t , a t ) will be updated. We model this by introducing an indicator function τ t (s, a) = 1 {(st,at)=(s,a)} . Then the update at time t ≥ 1 of asynchronous double Q-learning can be written compactly as Q A t+1 (s, a) = (1 -α t τ t (s, a)β t )Q A t (s, a) + α t τ t (s, a)β t R t + γQ B t (s , a * ) , Q B t+1 (s, a) = (1 -α t τ t (s, a)(1 -β t )) Q B t (s, a) + α t τ t (s, a)(1 -β t ) R t + γQ A t (s , b * ) , ) for all (s, a) ∈ S × A, where R t is evaluated as R t (s, a, s ). In the above update rules (3) and (4), at each iteration only one of the two Q-tables is randomly chosen to be updated. This chosen Q-table generates a greedy optimal action, and the other Qtable is used for estimating the corresponding Bellman operator (or evaluating the greedy action) for updating the chosen table. Specifically, if Q A is chosen to be updated, we use Q A to obtain the optimal action a * and then estimate the corresponding Bellman operator using Q B to update Q A . As shown in Hasselt (2010), E[Q B (s , a * )] is likely smaller than E max a [Q A (s , a)], where the expectation is taken over the randomness of the reward for the same (s, a, s ) tuple. Such a two-estimator framework adopted by double Q-learning can effectively reduce the overestimation. Without loss of generality, we assume that Q A and Q B are initialized with the same value (usually both all-zero tables in practice). For both synchronous and asynchronous double Q-learning, it has been shown in Xiong et al. (2020) that either Q-estimator is uniformly bounded by Rmax 1-γ throughout the learning process. Specifically, for either i ∈ {A, B}, we have Q i t ≤ Rmax 1-γ and Q i t -Q * ≤ 2Rmax 1-γ := V max for all t ≥ 1. This boundedness property will be useful in our finite-time analysis.

3. FINITE-TIME CONVERGENCE ANALYSIS

In this section, we start with modeling the error dynamics to be nested SAs, following by a convergence result for a general SA that will be applicable for both SAs. Then we provide the finite-time results for both synchronous and asynchronous double Q-learning. Finally, we sketch the proof of the main theorem for the synchronous algorithm to help understand the technical proofs.

3.1. CHARACTERIZATION OF THE ERROR DYNAMICS

In this subsection, we characterize the (a)synchronous double Q-learning algorithms as a pair of nested SA recursions, where the outer SA recursion captures the error dynamics between the Qestimator and the global optimum Q * , and the inner SA captures the error propagation between the two Q-estimators which enters into the outer SA as a noise term. Such a characterization enjoys useful properties that will facilitate the finite-time analysis. Outer SA: Denote the iteration error by r t = Q A t -Q * and define the empirical Bellman operator T t Q(s, a) := R t (s, a, s ) + γmax a ∈A Q(s , a ). Then we can have for all t ≥ 1 (see Appendix C), a) . Then it can be shown that (see Appendix C) r t+1 (s, a) = (1 -αt (s, a))r t (s, a) + αt (s, a) (G t (r t )(s, a) + ε t (s, a) + γν t (s , a * )) , (5) where ε t := T t Q * -Q * , ν t := Q B t -Q A t , G t (r t ) := T t Q A t -T t Q * = T t (r t + Q * ) -T t Q * , = E[ T t Q * ] -Q * = T Q * -Q * = 0 ∈ R D . Furthermore, define the span seminorm of Q * as Q * span := max (s,a)∈S×A Q * (s, a) - min (s,a)∈S×A Q * (s, ε t ≤ 2R max + γ Q * span := κ. Moreover, it is easy to show that G t (r t ) ≤ γ r t , which follows from the contractive property of the empirical Bellman operator given the same next state. We shall say that G t is quasi-contractive in the sense that the γ-contraction inequality only holds with respect to the origin 0. Inner SA: We further characterize the dynamics of ν t = Q B t -Q A t as an SA recursion (see Appendix C): ν t , and {µ t } t≥1 is a martingale difference sequence with respect to the filtration F t defined by F 1 = {∅, Ω} where Ω denotes the underlying probability space and for t ≥ 2, ν t+1 (s, a) = (1 -αt (s, a))ν t (s, a) + αt (s, a) (H t (ν t )(s, a) + µ t (s, a)) , F t = σ ({s k }, {R k-1 }, β k-1 , 2 ≤ k ≤ t) , for synchronous version, σ (s k , a k , R k-1 , β k-1 , 2 ≤ k ≤ t) , for asynchronous version, where we note that for synchronous sampling {s k } and {R k-1 } are the collections of sampled next states and the sampled rewards for each (s, a)-pair, respectively; while for asynchronous sampling, the pairs {(s k , a k , s k+1 )} k≥2 are consecutive sample transitions from one observed trajectory. In the sequel, we will provide the finite-time convergence guarantee for (a)synchronous double Qlearning using the SA recursions described by ( 5) and (7).

3.2. FINITE-TIME BOUND FOR A GENERAL SA

In this subsection, we develop a convergence result for a general SA that will be applicable for both inner and outer SAs described in Section 3.1. Consider the following general SA algorithm with the unique fixed point θ * = 0: θ t+1 = (1 -α t )θ t + α t (G t (θ t ) + ε t + γν t ) , for all t ≥ 1, where θ t ∈ R n and we abuse the notation of a general learning rate α t ∈ [0, 1). Then we bound θ t in the following proposition, the proof of which is provided in Appendix D. Proposition 1. Consider an SA given in ( 9). Suppose G t is quasi-contractive with a constant parameter γ, that is, G t (θ t ) ≤ γ θ t where γ ∈ (0, 1). Then for any learning rate α t ∈ [0, 1), the iterates {θ t } satisfy θ t ≤ t-1 k=1 (1 -(1 -γ)α k ) θ 1 + γα t-1 ( W t-1 + ν t-1 ) + γ t-2 k=1 t-1 l=k+1 (1 -(1 -γ)α l ) α k ( W k + ν k ) + W t , ( ) where the sequence {W t } is given by W t+1 = (1 -α t )W t + α t ε t with W 1 = 0. We note that an SA with a similar form to that in (9) has been analyzed in Wainwright (2019b) , which additionally requires a monotonicity assumption. In contrast, our analysis does not require this assumption. Moreover, distinct from Wainwright (2019b), we treat the noise terms ε t and ν t separately rather than bounding them together. This is because for double Q-learning, the noise term ν t has its own dynamics which is significantly more complex than the i.i.d. noise ε t . Bounding them as one noise term will yield more conservative results. Note that the SA recursion ( 7) is a special case of ( 9) by setting ν t = 0. Therefore, Proposition 1 is readily applicable to both ( 5) and ( 7).

3.3. FINITE-TIME ANALYSIS OF SYNCHRONOUS DOUBLE Q-LEARNING

We apply the above bound for SA to synchronous double Q-learning and bound the error r t = Q A t -Q * . The first result is stated in the following theorem. Theorem 1. Fix γ ∈ (0, 1). Consider synchronous double Q-learning in (3) with a rescaled linear learning rate α t = 3 3+(1-γ)t , ∀t ≥ 0. Then the learning error r t = Q A t -Q * satisfies E r t+1 ≤ 3 r 1 (1 -γ)t + 3 √ 3κ C (1 -γ) 3/2 1 √ t + 36 √ 3V max D (1 -γ) 5/2 1 √ t , where C := 6 √ ln 2D + 3 √ π, D := 2 √ ln 2D + √ π. and κ is defined in (6) which is the uniform bound of |ε t | . Theorem 1 provides the finite-time error bound for synchronous double Q-learning. To understand Theorem 1, the first term on the RHS (right hand side) of ( 11) shows that the initial error decays sub-linearly with respect to the number of iterations. The second term arises due to the fluctuation of the noise term ε t , which involves the problem specific quantity κ. The last item arises due to the fluctuation of the noise term µ t in the ν t -recursion (7), i.e., the difference between two Q-estimators. Corollary 1. The time complexity (i.e., the total number of iterations) to achieve an -accurate optimal Q-function (i.e., E r T ≤ ) is given by T ( , γ, D) = Ω ln D (1-γ) 7 2 . Proof. The proof follows directly from Theorem 1 by noting that the middle term on the RHS of (11) scales as 1 1-γ 5 2 since κ = 2R max + γ Q * span ≤ 2Rmax 1-γ = V max . We next compare Corollary 1 with the time complexity of synchronous double Q-learning provided in Xiong et al. (2020) , which is given by T = Ω 1 (1 -γ) 6 2 ln D (1 -γ) 7 2 1 ω + 1 1 -γ ln 1 (1 -γ) 2 1 1-ω , where ω ∈ ( 1 3 , 1). For the dominated regime (with relatively small γ), the result in (12) clearly cannot achieve the order of 1 2 and ln D as our result does. Further, its approaching to such an order (η → 0 in Table 1 ) is also at an additional cost of an asymptotically large exponent on 1 1-γ . For 1-γ dominated regime, the dependence on 1 -γ can be optimized by taking ω = 6 7 in (12), compared to which our result achieves an improvement by a factor of O ln 1 1-γ 7 (see Table 1 ).

3.4. FINITE-TIME ANALYSIS OF ASYNCHRONOUS DOUBLE Q-LEARNING

In this subsection, we provide the finite-time result for asynchronous double Q-learning. Differently from the synchronous version, at each iteration asynchronous double Q-learning only update one state-action pair of a randomly chosen Q-estimator. Thus the sampling strategy is important for the convergence analysis, for which we first make the following assumption. Assumption 1. The Markov chain induced by the stationary behavior policy π is uniformly ergodic. This is a standard assumption under which Markov chain is most widely studied (Paulin et al., 2015) . It was also assumed in (Qu & Wierman, 2020; Li et al., 2020) for the asynchronous samples in Qlearning. We further introduce the following standard notations (see for example Qu & Wierman (2020) ; Li et al. ( 2020)) that will be useful in the analysis. First, we denote µ π as the stationary distribution of the behavior policy over the state-action space S ×A and denote µ min := min (s,a)∈S×A µ π (s, a). It is easy to see that the smaller µ min is, the more iterations we need to visit all state-action pairs. Formally, we capture this probabilistic coverage by defining the following covering number: L = min t : min (s1,a1)∈S×A P(B t |(s 1 , a 1 )) ≥ 1 2 , ( ) where B t denotes the event that all state-action pairs have been visited at least once in t iterations. In addition, the ergodicity assumption indicates that the distribution of samples will approach to the stationary distribution µ π in a so-called mixing rate. We define the corresponding mixing time as t mix = min t : max (s1,a1)∈S×A d TV P t (•|(s 1 , a 1 )), µ π ≤ 1 4 , where P t (•|(s 1 , a 1 )) is the distribution of (s t , a t ) given the initial pair (s 1 , a 1 ), and d TV (µ, ν) is the variation distance between two distributions µ, ν. Next, we provide the first result for asynchronous double Q-learning in the following theorem whose proof is seen in Appendix H. Theorem 2. Fix γ ∈ (0, 1), δ ∈ (0, 1), ∈ (0, 1 1-γ ) and suppose that Assumption 1 holds. Consider asynchronous double Q-learning with a constant learning rate α t = α = c1 ln DT δ min (1 -γ) 6 2 , 1 tmix with some constant c 1 . Then asynchronous double Q-learning learns an -accurate optimum, i.e., Q A t -Q * ≤ , with probability at least 1 -δ given the time complexity of T = Ω 1 µ min 2 (1 -γ) 7 + t mix µ min (1 -γ) ln 1 (1 -γ) 2 , where t mix is defined in (14). The complexity in Theorem 2 is given in terms of the mixing time. To facilitate comparisons, we provide the following result in terms of the covering number. Theorem 3. Under the same conditions of Theorem 2, consider a constant learning rate α t = α = c2 ln DT δ min (1 -γ) 6 2 , 1 with some constant c 2 . Then asynchronous double Q-learning can learn an -accurate optimum, i.e., Q A T -Q * ≤ , with probability at least 1 -δ given the time complexity of T = Ω L 2 (1 -γ) 7 ln 1 (1 -γ) 2 , where L is defined in (13). We next compare Theorem 3 with the result obtained in Xiong et al. (2020) . In Xiong et al. (2020) , the authors provided the time complexity for asynchronous double Q-learning as T = Ω L 4 (1 -γ) 6 2 ln DL 4 (1 -γ) 7 2 1 ω + L 2 1 -γ ln 1 (1 -γ) 2 1 1-ω , where ω ∈ ( 1 3 , 1). It can be observed that our result improves that in (15) with respect to the order of all key parameters , D, 1 -γ, L (see Table 1 ). Specifically, the dependence on L in (15) can be optimized by choosing ω = 2 3 , upon which Theorem 3 improves by a factor of at least L 5 .

3.5. PROOF SKETCH OF THEOREM 1

In order to provide the convergence bound for double Q-learning under the rescaled linear learning rate, we develop a different analysis approach from that in Xiong et al. (2020) , the latter of which does not handle the rescaled linear learning rate. More specifically, in order to analyze a pair of nested SA recursions, we directly bound both the error dynamics of the outer SA between the Qestimator and the global optimum and the error propagation between the two Q-estimators captured by the inner SA. Then we integrate the bound on the inner SA into that on the outer SA as a noise term, and establish the final convergence bound. This is a very different yet more direct approach than the techniques in Xiong et al. (2020) which constructs two complicated block-wisely decreasing bounds for the two SAs to characterize a block-wise convergence. Our finite-time analysis for synchronous double Q-learning (i.e., Theorem 1) includes four steps. Step I: Bounding outer SA dynamics E r t by inner SA dynamics E ν t . Here, r t := Q A t -Q * captures the error dynamics between the Q-estimator and the global optimum, and ν t := Q A t -Q B t captures the error propagation between the two Q-estimators. We apply Proposition 1 to the error dynamics (5) of r t , take the expectation, and apply the learning rate inequality (24) to obtain E r t ≤ α t-1 r 1 + γ 2 α t-1 t-1 k=1 (E W k + E ν k ) + E W t , where W t+1 = (1 -αt )W t + αt ε t , with initialization W 1 = 0. Step II: Bounding E W t . We first construct a F t -martingale sequence { Wi } 1≤i≤t+1 with Wt+1 = W t+1 and W1 = 0. Next, we bound the squared difference sequence ( Wi+1 -Wi ) 2 by 4V 2 max α N t /α N -2 i , for 1 ≤ i ≤ t, where N is defined in (26). Then we apply the Azuma-Hoeffding inequality (see Lemma 5) to { Wi } 1≤i≤t+1 and further use Lemma 6 to obtain the bound on E W t in Proposition 2 which is given by E W t+1 ≤ κ C√ α t , where C = 6 √ ln 2D + 3 √ π and κ is defined in (6). Step III: Bounding inner SA dynamics E ν t . Similarly to Step I, we apply Proposition 1 to the ν t -recursion (7), take the expectation, and apply the learning rate inequality (24) to obtain E ν t ≤ α t-1 ν 1 + 1 + γ 2 α t-1 t-1 k=2 E M k + E M t , where M t+1 = (1 -α t )M t + α t µ t , with initialization M 1 = 0. Using a similar idea to Step II, we obtain the bound on E M t in Proposition 3. Finally, we substitute the bound of E M t back in (18) and use the fact ν 1 = 0 to obtain E ν t ≤ 6V max D 1 -γ √ α t-1 , with D = 2 √ ln 2D + √ π. ( ) Step IV: Deriving finite-time bound. Substituting ( 17) and ( 19) into ( 16) yields (11).

4. CONCLUSION

In this paper, we derived sharper finite-time bounds for double Q-learning with both synchronous sampling and Makovian asynchronous sampling. To achieve this, we developed a different approach to bound two nested stochastic approximation recursions. An important yet challenging future topic is the convergence guarantee for double Q-learning with function approximation. In addition to the lack of the contraction property of the Bellman operator in the function approximation setting, it is likely that neither of the two Q-estimators converges, or they do not converge to the same point even if they both converge. Characterizing the conditions under which double Q-learning with function approximation converges is still an open problem.



and the equivalent learning rate αt (s, a) := α t β t , for synchronous version α t β t τ t (s, a), for asynchronous version . Note that it is by design that we use the same sampled reward R t in both T t Q * and T t Q A t in the definition of G t (r t ). These newly introduced variables have several important properties. First of all, the noise term {ε t } t is a sequence of i.i.d. random variables satisfying Eε t

