ON O(1/K) CONVERGENCE AND LOW SAMPLE COM-PLEXITY FOR SINGLE-TIMESCALE POLICY EVALUA-TION WITH NONLINEAR FUNCTION APPROXIMATION Anonymous

Abstract

Learning an accurate value function for a given policy is a critical step in solving reinforcement learning (RL) problems. So far, however, the convergence speed and sample complexity performances of most existing policy evaluation algorithms remain unsatisfactory, particularly with non-linear function approximation. This challenge motivates us to develop a new variance-reduced primal-dual method (VRPD) that is able to achieve a fast convergence speed for RL policy evaluation with nonlinear function approximation. To lower the high sample complexity limitation of variance-reduced approaches (due to the periodic full gradient evaluation with all training data), we further propose an enhanced VRPD method with an adaptive-batch adjustment (VRPD + ). The main features of VRPD include: i) VRPD allows the use of constant step sizes and achieves the O(1/K) convergence rate to the first-order stationary points of non-convex policy evaluation problems; ii) VRPD is a generic single-timescale algorithm that is also applicable for solving a large class of non-convex strongly-concave minimax optimization problems; iii) By adaptively adjusting the batch size via historical stochastic gradient information, VRPD + is more sample-efficient empirically without loss of theoretical convergence rate. Our extensive numerical experiments verify our theoretical findings and showcase the high efficiency of the proposed VRPD and VRPD + algorithms compared with the state-of-the-art methods.

1. INTRODUCTION

In recent years, advances in reinforcement learning (RL) have achieved enormous successes in a large number of areas, including healthcare (Petersen et al., 2019; Raghu et al., 2017b) , financial recommendation (Theocharous et al., 2015) , resources management (Mao et al., 2016; Tesauro et al., 2006) and robotics (Kober et al., 2013; Levine et al., 2016; Raghu et al., 2017a) , to name just a few. In RL applications, an agent interacts with an environment and repeats the tasks of observing the current state, performing a policy-based action, receiving a reward, and transition to the next state. A key step in many RL algorithms is the policy evaluation (PE) problem, which aims to learn the value function that estimates the expected long-term accumulative reward for a given policy. Value functions not only explicitly provide the agent's accumulative rewards, but could also be utilized to update the current policy so that the agent can visit valuable states more frequently (Bertsekas & Tsitsiklis, 1995; Lagoudakis & Parr, 2003) . In RL policy evaluation, two of the most important performance metrics are convergence rate and sample complexity. First, since policy evaluation is a subroutine of an overall RL task, developing fast-converging policy evaluation algorithms is of critical importance to the overall efficiency of RL. Second, due to the challenges in collecting a large number of training samples (trajectories of state-action pairs) for policy evaluations in RL, reducing the number of samples (i.e., sample complexity) can significantly alleviate the burden of data collection for solving policy evaluation problems. These two important aspects motivate us to pursue a fast-converging policy evaluation algorithm with a low sample-complexity in this paper. Among various algorithms for policy evaluation, one of the simplest and most effective methods is the temporal difference (TD) learning approach (Sutton, 1988) . Instead of focusing on the predicted and actual outcomes, the key idea of the TD learning is to make the difference between temporally successive predictions small. Specifically, the TD learning approach learns the value function by using the Bellman equation to bootstrap from the current estimated value function. To date, there have been many algorithms proposed within the family of TD learning (Dann et al., 2014) . However, most of these methods suffer from either a unstable convergence performance, (e.g., TD(λ) (Sutton, 1988) for off-policy training) or a high computational complexity (e.g., least-squares temporal difference (LSTD) (Boyan, 2002; Bradtke & Barto, 1996) ) in training with massive features. The limitation of these early attempts is largely due to the fact that they do not leverage the gradient-oracle in policy evaluation. Thus, in recent years, gradient-based policy evaluation algorithms have become increasingly prevalent. However, the design of efficient gradient-based policy evaluation algorithm is a non-trivial task. On one hand, as an RL task becomes more sophisticated, it is more appropriate to utilize nonlinear function approximation (e.g., deep neural network (DNN)) to model the value function. However, when working with nonlinear DNN models, the convergence performance of the conventional single-timescale TD algorithms may not be guaranteed (Tsitsiklis & Van Roy, 1997) . To address this issue, some convergent two-timescale algorithms (Bhatnagar et al., 2009; Chung et al., 2018) have been proposed at the expense of higher implementation complexity. On the other hand, modern policy evaluation tasks could involve a large amount of state transition data. To perform policy evaluation, algorithms typically need to calculate full gradients that require all training data (e.g., gradient temporal difference (GTD) (Sutton et al., 2008) In this paper, we give an affirmative answer to the above question. Specifically, we propose an efficient gradient-based variance-reduced primal-dual algorithm (VRPD) to tackle the policy evaluation problem with nonlinear function approximation, which we recast as a minimax optimization problem. Our VRPD algorithm admits a simple and elegant single-timescale algorithmic structure. Then, we further enhance VRPD by proposing VRPD + , which uses adaptive batch sizes to relax the periodic full gradient evaluation to further reduce sample complexity. The main contribution of this paper is that our proposed algorithms achieve an O(1/K) convergence rate (K is the number of iterations) with constant step-sizes for policy evaluation with nonlinear function approximation, which is the best-known result in the literature thus far. Our main results are highlighted as follows: • By utilizing a variance reduction technique, our VRPD algorithm allows constant step-sizes and enjoys a low sample complexity. We show that, under mild assumptions and appropriate parameter choices, VRPD achieves an O(1/K) convergence rate to the first-order stationary point of a class of nonconvex-strongly-concave(NCSC) minimax problems, which is the best-known result in the literature. To achieve this result, our convergence rate analysis introduces new proof techniques and resolves an open question and clarifies an ambiguity in the state-of-the-art convergence analysis of VR-based policy evaluation methods (see 2nd paragraph in Section 2.1 for more discussions). • VRPD + significantly improves the sample complexity of the VRPD algorithm for policy evaluation with massive datasets. Our VRPD + (adaptive-batch VRPD) algorithm incorporates historical information along the optimization path, but does not involve backtracking and condition verification. We show that our VRPD + algorithm significantly reduces the number of samples and the computation loads of gradients, thanks to our proposed adaptive batch size technique that is able to avoid full gradient evaluation. • Our extensive experimental results also confirm that our algorithms outperform the state-of-theart gradient-based policy evaluation algorithms, and our VRPD + can further reduce the sample complexity compared to the VRPD algorithm. It is worth noting that, although the focus of our work is on RL policy evaluation, our algorithmic design and proof techniques contribute to the area of minimax optimization and could be of independent theoretical interest.

2. RELATED WORK

1) TD Learning with Function Approximation for Policy Evaluation: TD learning with function approximation plays a vital role in policy evaluation for RL. The key idea of TD learning is to



and TD with gradient correction (TDC) (Sutton et al., 2009b)), which entails a high sample complexity. So far, existing works on PE are either focus on linear approximation (GTD2 (Sutton et al., 2009b), PDBG (Du et al., 2017), SVRG (Du et al., 2017), SAGA (Du et al., 2017)) or have such a slower convergence performance (STSG (Qiu et al., 2020), VR-STSG (Qiu et al., 2020), nPD-VR (Wai et al.,2019)) (see detailed discussions in Section. 2). In light of the above limitations, in this paper, we ask the following question: Could we develop an efficient single-timescale gradient-based algorithm for policy evaluation based on nonlinear function approximation?

