GREEDY-GQ WITH VARIANCE REDUCTION: FINITE-TIME ANALYSIS AND IMPROVED COMPLEXITY

Abstract

Greedy-GQ is a value-based reinforcement learning (RL) algorithm for optimal control. Recently, the finite-time analysis of Greedy-GQ has been developed under linear function approximation and Markovian sampling, and the algorithm is shown to achieve an -stationary point with a sample complexity in the order of O( -3 ). Such a high sample complexity is due to the large variance induced by the Markovian samples. In this paper, we propose a variance-reduced Greedy-GQ (VR-Greedy-GQ) algorithm for off-policy optimal control. In particular, the algorithm applies the SVRG-based variance reduction scheme to reduce the stochastic variance of the two time-scale updates. We study the finite-time convergence of VR-Greedy-GQ under linear function approximation and Markovian sampling and show that the algorithm achieves a much smaller bias and variance error than the original Greedy-GQ. In particular, we prove that VR-Greedy-GQ achieves an improved sample complexity that is in the order of O( -2 ). We further compare the performance of VR-Greedy-GQ with that of Greedy-GQ in various RL experiments to corroborate our theoretical findings.

1. INTRODUCTION

In reinforcement learning (RL), an agent interacts with a stochastic environment following a certain policy and receives some reward, and it aims to learn an optimal policy that yields the maximum accumulated reward Sutton & Barto (2018) . In particular, many RL algorithms have been developed to learn the optimal control policy, and they have been widely applied to various practical applications such as finance, robotics, computer games and recommendation systems Mnih et al. (2015; 2016) ; Silver et al. (2016) ; Kober et al. (2013) . Conventional RL algorithms such as Q-learning Watkins & Dayan (1992) and SARSA Rummery & Niranjan (1994) have been well studied and their convergence is guaranteed in the tabular setting. However, it is known that these algorithms may diverge in the popular off-policy setting under linear function approximation Baird (1995); Gordon (1996) . To address this issue, the two time-scale Greedy-GQ algorithm was developed in Maei et al. (2010) for learning the optimal policy. This algorithm extends the efficient gradient temporal difference (GTD) algorithms for policy evaluation Sutton et al. (2009b) to policy optimization. In particular, the asymptotic convergence of Greedy-GQ to a stationary point has been established in Maei et al. (2010) . More recently, Wang & Zou (2020) studied the finite-time convergence of Greedy-GQ under linear function approximation and Markovian sampling, and it is shown that the algorithm achieves an -stationary point of the objective function with a sample complexity in the order of O( -3 ). Such an undesirable high sample complexity is caused by the large variance induced by the Markovian samples queried from the dynamic environment. Therefore, we want to ask the following question. • Q1: Can we develop a variance reduction scheme for the two time-scale Greedy-GQ algorithm? In fact, in the existing literature, many recent work proposed to apply the variance reduction techniques developed in the stochastic optimization literature to reduce the variance of various TD learning algorithms for policy evaluation, e.g., Du et al. ( 2017 2020). Hence, it is much desired to develop a variance-reduced Greedy-GQ algorithm for optimal control. In particular, as many of the existing variance-reduced RL algorithms have been shown to achieve an improved sample complexity under variance reduction, it is natural to ask the following fundamental question. • Q2: Can variance-reduced Greedy-GQ achieve an improved sample complexity under Markovian sampling? In this paper, we provide affirmative answers to these fundamental questions. Specifically, we develop a two time-scale variance reduction scheme for the Greedy-GQ algorithm by leveraging the SVRG scheme Johnson & Zhang (2013). Moreover, under linear function approximation and Markovian sampling, we prove that the proposed variance-reduced Greedy-GQ algorithm achieves an -stationary point with an improved sample complexity O( -2 ). We summarize our technical contributions as follows. 1.1 OUR CONTRIBUTIONS We develop a variance-reduced Greedy-GQ (VR-Greedy-GQ) algorithm for optimal control in reinforcement learning. Specifically, the algorithm leverages the SVRG variance reduction scheme Johnson & Zhang (2013) to construct variance-reduced stochastic updates for updating the parameters in both time-scales. We study the finite-time convergence of VR-Greedy-GQ under linear function approximation and Markovian sampling in the off-policy setting. Specifically, we show that VR-Greedy-GQ achieves an -stationary point of the objective function J (i.e., ∇J(θ) 2 ≤ ) with a sample complexity in the order of O( -2 ). Such a complexity result improves that of the original Greedy-GQ by a significant factor of O( -1 ) Wang & Zou (2020). In particular, our analysis shows that the bias error caused by the Markovian sampling and the variance error of the stochastic updates are in the order of O(M -1 ), O(η θ M -1 ), respectively, where η θ is the learning rate and M corresponds to the batch size of the SVRG reference batch update. This shows that the proposed variance reduction scheme can significantly reduce the bias and variance errors of the original Greedy-GQ update (by a factor of M ) and lead to an improved overall sample complexity. The analysis logic of VR-Greedy-GQ partly follows that of the conventional SVRG, but requires substantial new technical developments. Specifically, we must address the following challenges. First, VR-Greedy-GQ involves two time-scale variance-reduced updates that are correlated with each other. Such an extension of the SVRG scheme to the two time-scale updates is novel and requires new technical developments. Specifically, we need to develop tight variance bounds for the two time-scale updates under Markovian sampling. Second, unlike the convex objective functions of the conventional GTD type of algorithms, the objective function of VR-Greedy-GQ is generally non-convex due to the non-stationary target policy. Hence, we need to develop new techniques to characterize the per-iteration optimization progress towards a stationary point under nonconvexity. In particular, to analyze the two time-scale variance reduction updates of the algorithm, we introduce a 'fine-tuned' Lyapunov function of the form R m t = J(θ (m) t ) + c t θ (m) t -θ (m) 2 , where the parameter c t is fine-tuned to cancel other additional quadratic terms θ (m) t -θ (m) 2 that are implicitly involved in the tracking error terms. The design of this special Lyapunov function is critical to establish the formal convergence of the algorithm. With these technical developments, we are able to establish an improved finite-time convergence rate and sample complexity for VR-Greedy-GQ. 



); Peng et al. (2019); Korda & La (2015); Xu et al. (2020). Some other work applied variance reduction techniques to Q-learning algorithms, e.g., Wainwright (2019); Jia et al. (

RELATED WORK Q-learning and SARSA with function approximation. The asymptotic convergence of Q-learning and SARSA under linear function approximation were established in Melo et al. (2008); Perkins & Precup (2003), and their finite-time analysis were developed in Zou et al. (2019); Chen et al. (2019). However, these algorithms may diverge in off-policy training Baird (1995). Also, recent works focused on the Markovian setting. Various analysis techniques have been developed to analyze

