ON O(1/K) CONVERGENCE AND LOW SAMPLE COM-PLEXITY FOR SINGLE-TIMESCALE POLICY EVALUA-TION WITH NONLINEAR FUNCTION APPROXIMATION Anonymous

Abstract

Learning an accurate value function for a given policy is a critical step in solving reinforcement learning (RL) problems. So far, however, the convergence speed and sample complexity performances of most existing policy evaluation algorithms remain unsatisfactory, particularly with non-linear function approximation. This challenge motivates us to develop a new variance-reduced primal-dual method (VRPD) that is able to achieve a fast convergence speed for RL policy evaluation with nonlinear function approximation. To lower the high sample complexity limitation of variance-reduced approaches (due to the periodic full gradient evaluation with all training data), we further propose an enhanced VRPD method with an adaptive-batch adjustment (VRPD + ). The main features of VRPD include: i) VRPD allows the use of constant step sizes and achieves the O(1/K) convergence rate to the first-order stationary points of non-convex policy evaluation problems; ii) VRPD is a generic single-timescale algorithm that is also applicable for solving a large class of non-convex strongly-concave minimax optimization problems; iii) By adaptively adjusting the batch size via historical stochastic gradient information, VRPD + is more sample-efficient empirically without loss of theoretical convergence rate. Our extensive numerical experiments verify our theoretical findings and showcase the high efficiency of the proposed VRPD and VRPD + algorithms compared with the state-of-the-art methods.

1. INTRODUCTION

In recent years, advances in reinforcement learning (RL) have achieved enormous successes in a large number of areas, including healthcare (Petersen et al., 2019; Raghu et al., 2017b) , financial recommendation (Theocharous et al., 2015) , resources management (Mao et al., 2016; Tesauro et al., 2006) and robotics (Kober et al., 2013; Levine et al., 2016; Raghu et al., 2017a) , to name just a few. In RL applications, an agent interacts with an environment and repeats the tasks of observing the current state, performing a policy-based action, receiving a reward, and transition to the next state. A key step in many RL algorithms is the policy evaluation (PE) problem, which aims to learn the value function that estimates the expected long-term accumulative reward for a given policy. Value functions not only explicitly provide the agent's accumulative rewards, but could also be utilized to update the current policy so that the agent can visit valuable states more frequently (Bertsekas & Tsitsiklis, 1995; Lagoudakis & Parr, 2003) . In RL policy evaluation, two of the most important performance metrics are convergence rate and sample complexity. First, since policy evaluation is a subroutine of an overall RL task, developing fast-converging policy evaluation algorithms is of critical importance to the overall efficiency of RL. Second, due to the challenges in collecting a large number of training samples (trajectories of state-action pairs) for policy evaluations in RL, reducing the number of samples (i.e., sample complexity) can significantly alleviate the burden of data collection for solving policy evaluation problems. These two important aspects motivate us to pursue a fast-converging policy evaluation algorithm with a low sample-complexity in this paper. Among various algorithms for policy evaluation, one of the simplest and most effective methods is the temporal difference (TD) learning approach (Sutton, 1988) . Instead of focusing on the predicted and actual outcomes, the key idea of the TD learning is to make the difference between temporally successive predictions small. Specifically, the TD learning approach learns the value function by using the Bellman equation to bootstrap from the current estimated value function. To date, there have been many algorithms proposed within the family of TD learning (Dann et al., 2014) . However, most of these methods suffer from either a unstable convergence performance, (e.g., TD(λ) (Sutton, 1988) for off-policy training) or a high computational complexity (e.g., least-squares temporal difference (LSTD) (Boyan, 2002; Bradtke & Barto, 1996) ) in training with massive features. The limitation of these early attempts is largely due to the fact that they do not leverage the gradient-oracle in policy evaluation. Thus, in recent years, gradient-based policy evaluation algorithms have become increasingly prevalent. However, the design of efficient gradient-based policy evaluation algorithm is a non-trivial task. On one hand, as an RL task becomes more sophisticated, it is more appropriate to utilize nonlinear function approximation (e.g., deep neural network (DNN)) to model the value function. However, when working with nonlinear DNN models, the convergence performance of the conventional single-timescale TD algorithms may not be guaranteed (Tsitsiklis & Van Roy, 1997) . To address this issue, some convergent two-timescale algorithms (Bhatnagar et al., 2009; Chung et al., 2018) have been proposed at the expense of higher implementation complexity. On the other hand, modern policy evaluation tasks could involve a large amount of state transition data. To perform policy evaluation, algorithms typically need to calculate full gradients that require all training data (e.g., gradient temporal difference (GTD) (Sutton et al., 2008) and TD with gradient correction (TDC) (Sutton et al., 2009b) ), which entails a high sample complexity. So far, existing works on PE are either focus on linear approximation (GTD2 (Sutton et al., 2009b ), PDBG (Du et al., 2017 ), SVRG (Du et al., 2017 ), SAGA (Du et al., 2017) ) or have such a slower convergence performance (STSG (Qiu et al., 2020) , VR-STSG (Qiu et al., 2020) , nPD-VR (Wai et al., 2019) ) (see detailed discussions in Section. 2). In light of the above limitations, in this paper, we ask the following question: Could we develop an efficient single-timescale gradient-based algorithm for policy evaluation based on nonlinear function approximation? In this paper, we give an affirmative answer to the above question. Specifically, we propose an efficient gradient-based variance-reduced primal-dual algorithm (VRPD) to tackle the policy evaluation problem with nonlinear function approximation, which we recast as a minimax optimization problem. Our VRPD algorithm admits a simple and elegant single-timescale algorithmic structure. Then, we further enhance VRPD by proposing VRPD + , which uses adaptive batch sizes to relax the periodic full gradient evaluation to further reduce sample complexity. The main contribution of this paper is that our proposed algorithms achieve an O(1/K) convergence rate (K is the number of iterations) with constant step-sizes for policy evaluation with nonlinear function approximation, which is the best-known result in the literature thus far. Our main results are highlighted as follows: • By utilizing a variance reduction technique, our VRPD algorithm allows constant step-sizes and enjoys a low sample complexity. We show that, under mild assumptions and appropriate parameter choices, VRPD achieves an O(1/K) convergence rate to the first-order stationary point of a class of nonconvex-strongly-concave(NCSC) minimax problems, which is the best-known result in the literature. To achieve this result, our convergence rate analysis introduces new proof techniques and resolves an open question and clarifies an ambiguity in the state-of-the-art convergence analysis of VR-based policy evaluation methods (see 2nd paragraph in Section 2.1 for more discussions). • VRPD + significantly improves the sample complexity of the VRPD algorithm for policy evaluation with massive datasets. Our VRPD + (adaptive-batch VRPD) algorithm incorporates historical information along the optimization path, but does not involve backtracking and condition verification. We show that our VRPD + algorithm significantly reduces the number of samples and the computation loads of gradients, thanks to our proposed adaptive batch size technique that is able to avoid full gradient evaluation. • Our extensive experimental results also confirm that our algorithms outperform the state-of-theart gradient-based policy evaluation algorithms, and our VRPD + can further reduce the sample complexity compared to the VRPD algorithm. It is worth noting that, although the focus of our work is on RL policy evaluation, our algorithmic design and proof techniques contribute to the area of minimax optimization and could be of independent theoretical interest.

2. RELATED WORK

1) TD Learning with Function Approximation for Policy Evaluation: TD learning with function approximation plays a vital role in policy evaluation for RL. The key idea of TD learning is to 2017) that policy evaluation with linear function approximation by TD(0) can be formulated as a strongly convex-concave or convex-concave problem, and can be solved by a primal-dual method with a linear convergence rate. However, the linearity assumption cannot be applied in a wide range of policy evaluations with nonlinear models. TD learning with nonlinear (smooth) function approximation is far more complex. Maei et al. (2009) was among the first to propose a general framework for minimizing the generalized mean-squared projected Bellman error (MSPBE) with smooth and nonlinear value functions. However, they adopted twotimescale step-sizes but only obtained a slow convergence performance. Other TD methods with nonlinear function approximations for policy evaluations include (Wang et al., 2017; 2016) . Qiu et al. (2020) also investigated nonlinear TD learning and proposed two single-timescale first-order stochastic algorithms. However, the convergence rate of their STSG and VR-STSG are O(1/K 1/4 ) and O(1/K 1/3 ), while our VRPD algorithm achieves a much faster O(1/K) convergence rate. In policy evaluation with non-linear function approximation, the state-of-the-art and the most related work to ours is (Wai et al., 2019) , which showed that minimizing the generalized MSPBE problem is equivalent to solving a non-convex-strongly-concave (NCSC) minimax optimization problem via the Fenchel's duality. However, their best convergence results only hold when the step-size is O( 1 M ), where M is the size of the dataset. This is problematic for modern RL problems with a large state-action transition dataset. More importantly, although their convergence theorem appears to have a 1 K factor (K being the total number of iterations), their convergence rate bound is in the form of F (K) +Constant1 K•Constant2 (cf. Theorem 1, Eq. ( 26) in Wai et al. (2019) ). Notably, the F (K) term in the denominator in Eq. ( 26) inherently depends on the primal and dual values θ (K) and ω (K) in the K-th iteration, respectively. It is unclear whether ω (K) can be bounded in (Wai et al., 2019) , hence leading to an ambiguity in guaranteeing an O(1/K) convergence rate. Thus, whether an O(1/K) convergence rate is achievable in single-timescale policy evaluation with nonlinear function approximation and constant step-sizes remains an open question thus far. The key contribution and novelty in this paper is that we resolve the above open question by proposing two new algorithms, both achieving an O(1/K) convergence rate. To establish this result, we propose a new convergence metric (cf. Eq. ( 9) in Section 4.1), which necessitates new proof techniques and analysis. For easy comparisons, we summarize our algorithms and the related works in Table 1 . 2) Relations with NCSC Minimax Optimization: Although the focus of our paper is on RL policy evaluation, our algorithmic techniques are also related to the area of NCSC minimax optimization due to the primal-dual MSPBE formulation (cf. Eq. ( 2) in Section 3). Early attempts in (Nouiehed et al., 2019; Lin et al., 2020b) developed gradient descent-ascent algorithms to solve the NCSC minimax problems. However, these methods suffer from a high sample complexity and slow convergence rate. To overcome this limitation, two variance-reduction algorithms named SREDA (Luo et al., 2020) are proposed for solving NCSC minimax problems, which shares some similarity to our work. Later, Xu et al. (2020a) enhanced SREDA to allow bigger step-sizes. However, our algorithms still differ from SREDA in the following key aspects: (i) Our algorithms are single-timescale algorithms, which are much easier to implement. In comparison, SREDA is a two-timescale algorithm, where solving an inner concave maximization subproblem is needed. Thus, to a certain extent, SREDA can be viewed as a triple-loop structure, and hence the computational complexity of SREDA is higher than ours; (ii) In the initialization stage, SREDA uses the PiSARAH, which is a subroutine that aims to help the SREDA algorithm achieve the desired accuracy at the initialization step and can be seen as an additional step to solve an inner concave maximization subproblem. Thus, SREDA has a higher computation cost than our paper. (iii) The number of parameters in SREDA are far more than ours and it requires the knowledge of the condition number to set the algorithm's parameters for good convergence performance. By contrast, our algorithms only require step-sizes α and β to be sufficiently small, which is easier to tune in practice. (iv) SREDA does not provide an explicit convergence rate in their paper (it is unclear what their convergence rate is from their proof either). Yet, we show that our VRPD in theory has a lower sample complexity than that of SREDA. Another related work in terms of NCSC minimax optimization is (Zhang et al., 2021) , which also provided sample complexity upper and lower bounds. However, there remains a gap between the sample complexity lower and upper bounds in (Zhang et al., 2021) . By contrast, the sample complexity of our VRPD algorithm matches the lower bound O(M + √ M -2 ) in (Zhang et al., 2021) , which is the first in the literature. Furthermore, the algorithm contains an inner minimax subproblem (cf. Line 6 of Algorithm 1 in Zhang et al. (2021) ). Solving such a subproblems in the inner loop incurs high computational costs. Due to this reason, the algorithm in (Zhang et al., 2021) had to settle for an inexact solution, which hurts the convergence performance in practice. In contrast, our algorithm does not have such a limitation.

3. PRELIMINARIES AND PROBLEM STATEMENT

We start from introducing the necessary background of reinforcement learning, with a focus on the policy evaluation problem based on nonlinear function approximation. 1) Policy Evaluation with Nonlinear Approximation: RL problems are formulated using the Markov decision process (MDP) framework defined by a five-tuple {S, A, P, γ, R}, where S denotes the state space and A is the action space; P : S × A → S represents the transition function, which specifies the probability of one state transitioning to another after taking an action; R denotes the space of the received reward upon taking an action a ∈ A under state s ∈ S (in this paper, we assume that the state and action spaces are finite, but the numbers of states and actions could be large); and γ ∈ [0, 1) is a time-discount factor. For RL problems over an infinite discrete-time horizon {t ∈ N}, the learning agent executes an action a t according to the state s t and some policy π : S → A. The system then transitions into a new random state s t+1 in the next time-slot. Also, the agent receives a random reward R π (s t , a t ). The trajectory generated by a policy π is a sequence of state-action pairs denoted as {s 1 ,a 1 ,s 2 ,a 2 ,. . .}. The goal of the agent is to learn an optimal policy π * to maximize the long-term discounted total reward. Specifically, for a policy π (could be a randomized policy), the expected reward received by the agent at state s in any given time-slot can be computed as R π (s t ) = E a∼π(•|s) R π (s t , a) . The value function V π (s 0 ) = E [ ∞ t=0 γ t R (s t ) | s 0 , π] indicates the long-term discounted reward of policy π over an infinite horizon with the initial state at s 0 ∈ S. Also, the Bellman equation implies that V π (•): V (s)=T π V (s), where T π f (s) E[R π (s) + γf (s )|a ∼ π(•|s), s ∼ P (•|s, a) ] denotes the Bellman operator. In RL, the agent's goal is to determine an optimal policy π * that maximizes the value function V π (s) from any initial state s. However, the first obstacle in solving RL problems stems from evaluating V π (•) for a given π since P (•|s, a) is unknown. Moreover, it is often infeasible to store V π (s) since the state space S could be large. To address these challenges, one popular approach in RL is to approximate V π (•) using a family of parametric and smooth functions in the form of V π (•) ≈ V θ π (•), where θ π ∈ R d is a d-dimensional parameter vector. Here, Θ is a compact subspace. For notational simplicity, we will omit all superscripts "π" whenever the policy π is clear from the context. In this paper, we focus on nonlinear function approximation, i.e., V θ (•) : S → R is a nonlinear function with respect to (w.r.t.) θ. For example, V θ (•) could be based on a θ-parameterized nonlinear DNN. We assume that the gradient and Hessian of V θ (•) exist and are denoted as: g θ (s) := ∇ θ V θ (s) ∈ R d , H θ (s) := ∇ 2 θ V θ (s) ∈ R d×d . Our goal is to find the optimal parameter θ * ∈ R d that minimizes the error between V θ * (•) and V (•). This problem can be formulated as minimizing the mean-squared projected Bellman error (MSPBE) of the value function as follows (Liu et al., 2018) : MSPBE(θ) := 1 2 E s∼D π (•) T π V θ (s)-V θ (s) ∇ θ V θ (s) 2 D -1 = max ω∈R d - 1 2 E s∼D π (•) [(ω g θ (s)) 2 ] + ω, E s∼D π (•) (T π V θ (s)-V θ (s))g θ (s) , where D π (•) is the stationary distribution of under policy π and D = E s∼D π [g θ (s)g θ (s)] ∈ R d×d . 2) Primal-Dual Optimization for MSPBE: It is shown in (Liu et al., 2018) (cf. Proposition 1) that minimizing MSPBE(θ) in ( 1) is equivalent to solving a primal-dual minimax optimization problem: min θ∈R d max ω∈R d L(θ, ω), where L(θ, ω) ω, E s∼D π (•) (T π V θ (s) -V θ (s))g θ (s) -1 2 E s∼D π (•) [(ω g θ (s)) 2 ]. Since the distribution D π (•) is unknown and the expectation cannot be evaluated directly, one often considers the following empirical minimax problem by replacing the expectation in L(θ, ω) with a finite sample average approximation in the stochastic objective function based on an M -step trajectory, i.e., min θ∈R d max ω∈R d L(θ, ω) = min θ∈R d max ω∈R d 1 M M i=1 L i (θ, ω), where L i (θ, ω) := ω,[R(s i , a i , s i+1 ) + γV θ (s i+1 ) -V θ (s i )] × g θ (s i ) - 1 2 (ω g θ (s i )) 2 . ( ) Solving the above empirical minimax problem for MSPBE constitutes the rest of this paper.

4. SOLUTION APPROACH

As mentioned in Section 3, based on an M -step trajectory {s 1 , a 1 • • • , s M , a M , s M +1 } generated by some policy π, our goal is to solve the empirical primal-dual and finite-sum optimization problem: min θ∈R d max ω∈W 1 M M i=1 L i (θ, ω) = min θ∈R d max ω∈W L(θ, ω), where W is assumed to be a convex constrained set (Problem (4) becomes Problem (2) when W = R d ). In our Appendix, we also discussed the min-max problem while θ ∈ Θ. Θ is a convex constrained set. See details in Appendix. 12. Note that Problem (4) could be non-convex (e.g., DNN-based nonlinear approximation). Let J(θ) max ω∈W L(θ, ω). Then, we can equivalently rewrite Problem (4) as follows: min θ∈R d max ω∈W L(θ, ω) = min θ∈R d J(θ). Note from (3) that L(θ, ω) is strongly concave w.r.t. ω, which guarantees the existence and uniqueness of the solution to the problem max ω∈W L(θ, ω), ∀θ ∈ R d . Then, given θ ∈ R d , we define the following notation: ω * (θ) := argmax ω∈W L(θ, ω). Thus, J(θ) can be further written as: J(θ) = L(θ, ω * ) = max ω∈W L(θ, ω). The function J(θ) can be viewed as a finite empirical version of MSPBE. We aim to minimize J(θ) by finding the stationary point of L(θ, ω). To simplify the notaion, we use ω * to denote ω * (θ). Note that if D in Eq. ( 1) is positive definite, Problem (4) is strongly concave in ω, but non-convex in θ in general due to the non-convexity of function V θ . Thus, the stated primal-dual objective function is a NCSC optimization problem. In this paper, we make the following assumptions: Assumption 1 (µ-Strongly Concavity). The differentiable function L(θ, ω) is µ-strongly concave in ω: if L(θ, ω) ≤ L(θ, ω ) + ∇ ω L(θ, ω ) (ω -ω ) -µ 2 ω -ω 2 , ∀ω, ω ∈ R d , µ > 0 and any fixed θ ∈ R d . The above mentioned condition is equivalent to : ∇ ω L(θ, ω) -∇ ω L(θ, ω ) ≥ µ ω -ω , ∀ω, ω ∈ R d . Similar proofs can be found in Lemma 2 and 3 in Zhou (2018) . Assumption 2 (L f -Smoothness). For i = 1, 2, . . . , M , both gradient ∇ θ L i (θ, ω) and ∇ ω L i (θ, ω) are L f -smooth. That is, for all θ, θ ∈ R d and ω, ω ∈ R d , there exists a constant L f > 0 such that ∇L i (θ, ω) -∇L i (θ , ω ) ≤ L f θ -θ + ω -ω . Algorithm 1 The Variance-Reduced Primal-Dual Stochastic Gradient Method (VRPD). Input: An M -step trajectory of the state-action pairs {s 1 , a 1 , s 2 , a 2 , • • • , s M , a M , s M +1 } generated from a given policy; step sizes α, β ≥ 0; initialization points θ 0 ∈ R d , ω 0 ∈ W. Output: (θ ( K) , ω ( K) ), where K is independently and uniformly picked from {1, • • • , K}; 1: for k = 0, 1, 2, • • • , K -1 do 2: If mod(k, q) = 0, compute full gradients G (k) θ , G ω as in Eq. ( 6). 3: Otherwise, select S samples independently and uniformly from [M ], and compute gradients as in Eq. ( 7).

4:

Perform the primal-dual updates to obtain the next iterate θ (k+1) , ω (k+1) as in Eq. ( 8). 5: end for Assumption 3 (Bounded Variance). There exists a constant σ > 0 such that for all θ ∈ R d , ω ∈ R d , 1 M M i=1 ∇ θ L i (θ, ω) -∇ θ L(θ, ω) 2 ≤ σ 2 and 1 M M i=1 ∇ ω L i (θ, ω) -∇ ω L(θ, ω) 2 ≤ σ 2 . In the above assumptions, Assumption 1 is satisfied if the number of samples M is sufficiently large and coupling with the fact that the matrix D is positive definite. To see that, note that µ = λmin (D) > 0, where D = Es ∇ θ V θ (s)∇ θ V θ (s) ∈ R d×d and D tends to be full-rank as M increases. Thus, as soon as we find a µ > 0 when M is sufficiently large, this µ is independent of M as M continues to increase. Assumption 2 is standard in the optimization literature. Assumption 3 is also commonly adopted for proving convergence results of SGD-and VR-based algorithms, or algorithms that draw a mini-batch of samples instead of all samples. Assumption 3 is guaranteed to hold under the compact set condition and common for stochastic approximation algorithms for minimax optimization (Qiu et al., 2020; Lin et al., 2020a) . Assumptions 1-3 are also general assumptions often used in temporal difference (TD) problems (see, e.g., (Qiu et al., 2020; Wai et al., 2019) ). With these assumptions, we are now in a position to present our algorithms and their convergence performance results.

4.1. THE VARIANCE-REDUCED PRIMAL-DUAL METHOD

In this section, we first present the variance-reduced primal-dual (VRPD) algorithm for solving policy evaluation problems, followed by the theoretical convergence results. Due to space limitation, we provide a proof sketch in the main text and relegate the proof to the supplementary material.

1) Algorithm Description:

The full description of VRPD is illustrated in Algorithm 1. In VRPD, for every q iterations, the algorithm calculates the full gradients as follows: G (k) θ = 1 |M | i∈M ∇ θ L i (θ (k) , ω (k) ); G (k) ω = 1 |M | i∈M ∇ ω L i (θ (k) , ω (k) ). In all other iterations, VRPD selects a batch of samples S and computes variance-reduced gradient estimators as: G (k) θ = 1 |S| i∈S ∇ θ L i (θ (k) , ω (k) ) -∇ θ L i (θ (k-1) , ω (k-1) ) + G (k-1) θ ; (7a) G (k) ω = 1 |S| i∈S ∇ ω L i (θ (k) , ω (k) ) -∇ ω L i (θ (k-1) , ω (k-1) ) + G (k-1) ω . ( ) The estimators in ( 7) are constructed iteratively based on the previous update information ∇ θ L i (θ (k-1) , ω (k-1) ) (resp. (∇ ω L i (θ (k-1) , ω (k-1) ) ) and G (k-1) θ (resp. G (k-1) ω ). VRPD updates the primal and dual variables as follows: θ (k+1) = θ (k) -βG (k) θ ; (8a) ω (k+1) = P W (ω (k) + αG (k) ω ) = argmin ω∈Ω ω -(ω (k) + αG (k) ω ) 2 , ( ) where the parameters α and β are constant learning rates for primal and dual updates, respectively. 2) Convergence Performance: In this paper, we propose a new metric for convergence analysis: M (k) := ∇J(θ (k) ) 2 + 2 ω (k) -ω * (θ (k) ) 2 . ( ) The first term in (9) measures the convergence of the primal variable θ. As common in nonconvex optimization analysis, ∇J(θ) 2 = 0 indicates that θ is a first-order stationary point (FOSP) of Problem (4). The second term in (9) measures the convergence of ω (k) to the unique maximizer ω (k) * for L(θ k , •). Note that if Problem ( 4) is unconstrained in dual (i.e., ω ∈ R d ), it follows from Assumption 2 and k) , ω (k) ) 2 . We now introduce the notion of the approximate first-order stationary points. We say that point {θ, ω} is an -stationary point of function L(θ, ω) if M ≤ is satisfied. ∇ ω L(θ (k) , ω * (θ (k) )) 2 = 0 that M (k) ≥ ∇J(θ (k) ) 2 + 2 L 2 f ∇ ω L(θ ( Remark. Several important remarks on the connections between our metric M (k) and the conventional convergence metrics in the literature are in order. A conventional convergence metric in the literature for NCSC minimax optimization is ∇J(θ (k) ) 2 (Lin et al., 2020a; Luo et al., 2020; Zhang et al., 2021) , which is the first term of M (k) and measures the convergence of the primal variable θ under a given dual variable ω. This is because ∇J(θ) 2 = 0 implies that θ is a FOSP. The novelty in our convergence metric is the second term in M (k) , which measures the convergence of ω k to the unique maximizer ω * k for L(θ k , •). Another conventional convergence metric in the literature of minimizing the empirical MSPBE problem is Van Roy, 1997) . Since the nonconvexstrong-concave minimax optimization problem is unconstrained in dual (i.e., ω ∈ R d ), it follows from Lipschitz-smoothness in Assumption 2 and k) implies an O(1/K) convergence rate of the conventional metric, but the converse is not true. Moreover, the benefit of using 2 ω (k) -ω * (θ (k) ) 2 in our M (k) is that its special structure allows us to prove the O(1/K) convergence, while the second term in the conventional metric fails. ∇ θ L(θ, ω) 2 + ∇ ω L(θ, ω) 2 (Tsitsiklis & ∇ ω L(θ (k) , ω * (θ (k) )) 2 = 0 that ω (k) -ω * (θ (k) ) 2 ≥ 1 L 2 f ∇ ω L(θ (k) , ω (k) ) 2 . Therefore, the second term in our M (k) (2 ω (k) -ω * (θ (k) ) 2 ) is an upper bound of the second term in this conventional metric ( ∇ ω L(θ, ω) 2 ). Thus, 2 ω (k) -ω * (θ (k) ) 2 is a stronger metric than ∇ ω L(θ, ω) 2 in the sense that an O(1/K) convergence rate under M ( With our proposed convergence metric in (9), we have the following convergence result: Theorem 1. Under Assumptions 1-3, choose step-sizes: α ≤ min{ 1 4L f , 2µ 34L 2 f +2µ 2 } and β ≤ min 1 4L f , 1 2(L f +L 2 f /µ) , µ 8 √ 17L 2 f , µ 2 α 8 √ 34L 2 f . Let q = √ M and S = √ M , it holds that: 1 K K-1 k=0 E[M (k) ] ≤ 1 K min{1, L 2 f } 16L 2 f αµ C 2 + 2 β C 1 = O 1 K , where C 1 E[J(θ (0) )] -E[J(θ ( * ) )] and C 2 E ω * (θ (0) ) -ω (0) 2 . Corollary 2. The overall stochastic sample complexity is O( √ M κ 3 -1 + M ). Note that κ = L f /µ denotes the condition number. Remark. Theorem 1 states that VRPD achieves an O(1/K) convergence rate to an -FOSP. The most challenging part in proving Theorem 1 stems from the fact that one needs to simultaneously evaluate the progresses of the gradient descent in the primal domain and the gradient ascent in the dual domain of the minimax problem. Toward this end, the nPD-VR method in (Wai et al., 2019) employs ∇ ω L(θ (k) , ω (k) ) 2 in their metric to evaluate convergence. However, this approach yields a term ) , ω (K) )] in their convergence upper bound in the form of O(F (K) /K) (cf. Theorem 1, Eq. ( 26) in (Wai et al., 2019) ). Since F (K) depends on K, it is unclear whether or not the nPD-VR method in (Wai et al., 2019) can achieve an O(1/K) convergence rate. This unsatisfactory result motivates us to propose a new metric M (k) in Eq. ( 9) to evaluate the convergence of our VRPD algorithm. The first part of our convergence metric ∇J(θ (k) ) 2 measures the stationarity gap of the primal variable, while the second part 2 ω (k) -ω * (θ (k) ) 2 measures the dual optimality gap. Consequently, we bound per-iteration change in J(θ) instead of the function L(θ (k) , ω (k) ). This helps us avoid the technical limitations of (Wai et al., 2019) and successfully establish the O(1/K) convergence rate, hence resolving an open problem in this area. Remark. VRPD adopts a large O(1) (i.e., constant) step-size compared to the O(1/M ) step-size of nPD-VR (Wai et al., 2019) , where M is the dataset size. This also induces a faster convergence. Also, VRPD's estimator uses fresher information from the previous iteration, while VR-STSG (Qiu Algorithm 2 Adaptive-batch VRPD method (VRPD + ). F (K) E[L(θ (0) , ω (0) ) - L(θ (K Input: A trajectory of the state-action pairs {s 1 , a 1 , s 2 , a 2 , • • • , s M , a M , s M +1 } generated from a given policy; step sizes α, β ≥ 0; initialization points θ 0 ∈ Θ, ω 0 ∈ R d . Output: (θ ( K) , ω ( K) ), where K is independently and uniformly picked from {1, • • • , K}; 1: for k = 0, 1, 2, • • • , K -1 do 2: If mod(k, q) = 0, select N s indices independently and uniformly from [M ] as in Eq. ( 10) and calculate stochastic gradients as in Eq. ( 11); 3: Otherwise, select S independently and uniformly from [M ]; Compute gradients as in Eq. ( 7); 4: Perform the primal-dual updates as in Eq. ( 8). 5: end for et al., 2020) and nPD-VR (Wai et al., 2019) only use the information from the beginning of q-sized windows. Collectively, VRPD makes a considerably larger progress than state-of-the-art algorithms (Qiu et al., 2020; Wai et al., 2019) .

4.2. THE ADAPTIVE-BATCH VRPD METHOD (VRPD + )

Note that VRPD still requires full gradients every q iterations, which may entail a high sample complexity. Upon closer observations, we note that accurate gradient estimation plays an important role only in the later stage of the convergence process. This motivates us to further lower the sample complexity of VRPD by using adaptive batch sizes. Toward this end, we propose an adaptive-batch VRPD method (VRPD + ) to lower the sample complexity of the VRPD algorithm in Algorithm 1. 1) Algorithm Description: The full description of VRPD + is illustrated in Algorithm 2. In VRPD + , our key idea is to use the gradients calculated in the previous loop to adjust the batch size N s of the next loop. Specifically, VRPD + chooses N s in the k-th iteration as: N s = min{c γ σ 2 (γ (k) ) -1 , c σ 2 -1 , M }, where c γ , c > c for certain constant c, M denotes the size of the dataset and σ 2 is the variance bound, and γ (k+1) = k i=(n k -1)q G (i) θ 2 q is the stochastic gradients calculated in the previous iterations. In VRPD + , for every q iterations, we select N s samples independently and uniformly from [M ] and compute gradient estimators as follows: G (k) θ = 1 |Ns| i∈Ns ∇ θ Li(θ (k) , ω (k) ); G (k) ω = 1 |Ns| i∈Ns ∇ θ Li(θ (k) , ω (k) ). For other iterations, VRPD + is exactly the same as VRPD. Next, we will theoretically show that such an adaptive batch-size scheme still retains the same convergence rate, while achieving an improved sample complexity. 2) Convergence Performance: For VRPD + , we have the following convergence performance result: Theorem 3. Under Assumptions 1-3, choose step-sizes: α ≤ min{ 1 4L f , 2µ 34L 2 f +2µ 2 } and β ≤ min 1 4L f , 1 2(L f +L 2 f /µ) , µ 8 √ 17L 2 f , µ 2 α 8 √ 34L 2 f . Let q = √ M , S = √ M and c γ ≥ (288L 2 f /µ 2 + 8) in VRPD + , where c γ ≥ c for some constant c > 4K + 68K βµ 2 . With constants C 1 E[J(θ (0) )] - E[J(θ ( * ) )] and C 2 E[ ω * (θ (0) ) -ω (0) 2 ), it holds that: 1 K K-1 k=0 E[M (k) ] ≤ 1 K min{1, L 2 f } K • 2 + 16L 2 f αµ C2 + 2 β C1 = O 1 K + 2 . Corollary 4. The overall stochastic sample complexity is O( √ M κ 3 -1 + M ). κ = L f /µ denotes the condition number. Remark. From Theorem 3, it can be seen that VRPD + achieves the same convergence rate as that of VRPD. Since we choose the subsample set N s instead of full gradient calculation in VRPD + , it achieves a much lower sample complexity compared to VRPD. Additionally, the convergence performance of VRPD + is affected by the constant K 2 , which is due to the use of the adaptive batch size in each outer-loop of VRPD + . Also, it can be observed that the algorithm convergence rate is affected by the carefully chosen step-sizes α and β, because either a too small or too large step-size may have negative impact on the convergence of the algorithm. Remark. The proof of Theorem 3 follows from a similar approach to the proof of Theorem 1. The key difference and most challenging part of proving Theorem 3 stem from the relaxation on ∇ θ L(θ (k) , ω (k) ) -G (k) θ 2 and ∇ ω L(θ (k) , ω (k) ) -G (k) ω 2 . Thanks to the bounded variance in Assumption 3 and the selected N s in Eq. ( 10), we are able to derive outer-loop bounds for primal and dual gaps, respectively. We refer readers to the Appendix for the details of the complete proof.

5. EXPERIMENTAL RESULTS

In this section, we conduct numerical experiments to verify our theoretical results. We compare our work with the basic stochastic gradient (SG) method (Lin et al., 2020b) and three state-of-the-art algorithms for PE: nPD-VR (Wai et al., 2019) , STSG (Qiu et al., 2020) and VR-STSG (Qiu et al., 2020) . Due to space limitation, we provide our detailed experiment settings in the Appendix. Numerical Results: We set the constant learning rates α = 10 -3 , β = 10 -1 , mini-batch size q = √ M , constant c = 32 and solution accuracy = 10 -3 . First, we compare the loss value and gradient norm performance based on MountainCar-v0 and Cartpole-v0 with nPD-VR, SG, STSG, and VR-STSG in Figs. 1 and 2 . We set the constraint W = [0, 10] n and initialize all algorithms at the same point, which is generated randomly from the normal distribution. We can see that VR-STSG and nPD-VR slowly converge after 40 epochs, while STSG and SG fail to converge. VRPD converges faster than all the other algorithms with the same step-size values. As for Cartpole-v0, we clearly see a trend of approaching zero-loss with VRPD. These results are consistent with our theoretical result that one can use a relatively large step-size with VRPD, which leads to faster convergence. Also, we compare the sample complexity of VRPD and VRPD + in MountainCar-v0 and Cartpole-v0, and the results are shown in in Figs. 3 and 4 , respectively. We can see that VRPD + converges to the same level with much fewer samples than VRPD does. Next, we compared the mean squared error(MSE) between the ground truth value function and the estimated value function over 10 independent runs with linear approximation and nonlinear approximation. In Fig. 5 , with the same amount of parameter size, nonlinear approximation always achieves smaller MSE than linear approximation (Du et al., 2017) . Further experiments on the performance of J(θ) are shown in the supplementary material.

6. CONCLUSION

In this paper, we proposed and analyzed two algorithms called VRPD and VRPD + for policy evaluation with nonlinear approximation. The VRPD algorithm is based on a simple single-timescale framework by utilizing variance reduction techniques. The VRPD algorithm allows the use of constant step-sizes and achieves an O(1/K) convergence rate. The VRPD + algorithm improves VRPD by further applying an adaptive batch size based on historical stochastic gradient information. Our experimental results also confirmed our theoretical findings in convergence and sample complexity.



Figure 1: MountainCar-v0 environment.

Figure 5: MSE comparison with 10 trials.

Algorithms comparison for solving policy evaluation. M is the size of the dataset; K is the total iteration.

