GREEDY-GQ WITH VARIANCE REDUCTION: FINITE-TIME ANALYSIS AND IMPROVED COMPLEXITY

Abstract

Greedy-GQ is a value-based reinforcement learning (RL) algorithm for optimal control. Recently, the finite-time analysis of Greedy-GQ has been developed under linear function approximation and Markovian sampling, and the algorithm is shown to achieve an -stationary point with a sample complexity in the order of O( -3 ). Such a high sample complexity is due to the large variance induced by the Markovian samples. In this paper, we propose a variance-reduced Greedy-GQ (VR-Greedy-GQ) algorithm for off-policy optimal control. In particular, the algorithm applies the SVRG-based variance reduction scheme to reduce the stochastic variance of the two time-scale updates. We study the finite-time convergence of VR-Greedy-GQ under linear function approximation and Markovian sampling and show that the algorithm achieves a much smaller bias and variance error than the original Greedy-GQ. In particular, we prove that VR-Greedy-GQ achieves an improved sample complexity that is in the order of O( -2 ). We further compare the performance of VR-Greedy-GQ with that of Greedy-GQ in various RL experiments to corroborate our theoretical findings.

1. INTRODUCTION

In reinforcement learning (RL), an agent interacts with a stochastic environment following a certain policy and receives some reward, and it aims to learn an optimal policy that yields the maximum accumulated reward Sutton & Barto (2018) . In particular, many RL algorithms have been developed to learn the optimal control policy, and they have been widely applied to various practical applications such as finance, robotics, computer games and recommendation systems Mnih et al. (2015; 2016); Silver et al. (2016) ; Kober et al. (2013) . Conventional RL algorithms such as Q-learning Watkins & Dayan (1992) and SARSA Rummery & Niranjan (1994) have been well studied and their convergence is guaranteed in the tabular setting. However, it is known that these algorithms may diverge in the popular off-policy setting under linear function approximation Baird (1995); Gordon (1996) . To address this issue, the two time-scale Greedy-GQ algorithm was developed in Maei et al. (2010) for learning the optimal policy. This algorithm extends the efficient gradient temporal difference (GTD) algorithms for policy evaluation Sutton et al. (2009b) to policy optimization. In particular, the asymptotic convergence of Greedy-GQ to a stationary point has been established in Maei et al. (2010) . More recently, Wang & Zou (2020) studied the finite-time convergence of Greedy-GQ under linear function approximation and Markovian sampling, and it is shown that the algorithm achieves an -stationary point of the objective function with a sample complexity in the order of O( -3 ). Such an undesirable high sample complexity is caused by the large variance induced by the Markovian samples queried from the dynamic environment. Therefore, we want to ask the following question. • Q1: Can we develop a variance reduction scheme for the two time-scale Greedy-GQ algorithm?

