ON O(1/K) CONVERGENCE AND LOW SAMPLE COM-PLEXITY FOR SINGLE-TIMESCALE POLICY EVALUA-TION WITH NONLINEAR FUNCTION APPROXIMATION Anonymous

Abstract

Learning an accurate value function for a given policy is a critical step in solving reinforcement learning (RL) problems. So far, however, the convergence speed and sample complexity performances of most existing policy evaluation algorithms remain unsatisfactory, particularly with non-linear function approximation. This challenge motivates us to develop a new variance-reduced primal-dual method (VRPD) that is able to achieve a fast convergence speed for RL policy evaluation with nonlinear function approximation. To lower the high sample complexity limitation of variance-reduced approaches (due to the periodic full gradient evaluation with all training data), we further propose an enhanced VRPD method with an adaptive-batch adjustment (VRPD + ). The main features of VRPD include: i) VRPD allows the use of constant step sizes and achieves the O(1/K) convergence rate to the first-order stationary points of non-convex policy evaluation problems; ii) VRPD is a generic single-timescale algorithm that is also applicable for solving a large class of non-convex strongly-concave minimax optimization problems; iii) By adaptively adjusting the batch size via historical stochastic gradient information, VRPD + is more sample-efficient empirically without loss of theoretical convergence rate. Our extensive numerical experiments verify our theoretical findings and showcase the high efficiency of the proposed VRPD and VRPD + algorithms compared with the state-of-the-art methods.

1. INTRODUCTION

In recent years, advances in reinforcement learning (RL) have achieved enormous successes in a large number of areas, including healthcare (Petersen et al., 2019; Raghu et al., 2017b) , financial recommendation (Theocharous et al., 2015) , resources management (Mao et al., 2016; Tesauro et al., 2006) and robotics (Kober et al., 2013; Levine et al., 2016; Raghu et al., 2017a) , to name just a few. In RL applications, an agent interacts with an environment and repeats the tasks of observing the current state, performing a policy-based action, receiving a reward, and transition to the next state. A key step in many RL algorithms is the policy evaluation (PE) problem, which aims to learn the value function that estimates the expected long-term accumulative reward for a given policy. Value functions not only explicitly provide the agent's accumulative rewards, but could also be utilized to update the current policy so that the agent can visit valuable states more frequently (Bertsekas & Tsitsiklis, 1995; Lagoudakis & Parr, 2003) . In RL policy evaluation, two of the most important performance metrics are convergence rate and sample complexity. First, since policy evaluation is a subroutine of an overall RL task, developing fast-converging policy evaluation algorithms is of critical importance to the overall efficiency of RL. Second, due to the challenges in collecting a large number of training samples (trajectories of state-action pairs) for policy evaluations in RL, reducing the number of samples (i.e., sample complexity) can significantly alleviate the burden of data collection for solving policy evaluation problems. These two important aspects motivate us to pursue a fast-converging policy evaluation algorithm with a low sample-complexity in this paper. Among various algorithms for policy evaluation, one of the simplest and most effective methods is the temporal difference (TD) learning approach (Sutton, 1988) . Instead of focusing on the predicted and actual outcomes, the key idea of the TD learning is to make the difference between temporally successive predictions small. Specifically, the TD learning approach learns the value function by

