LINEAR CONVERGENCE OF NATURAL POLICY GRADIENT METHODS WITH LOG-LINEAR POLICIES

Abstract

We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class. Using the compatible function approximation framework, both methods with log-linear policies can be written as inexact versions of the policy mirror descent (PMD) method. We show that both methods attain linear convergence rates and Õ(1/ 2 ) sample complexities using a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other strongly convex regularization. Lastly, as a byproduct, we obtain sublinear convergence rates for both methods with arbitrary constant step size.

1. INTRODUCTION

Policy gradient (PG) methods have emerged as a popular class of algorithms for reinforcement learning. Unlike classical methods based on (approximate) dynamic programming (e.g., Puterman, 1994; Sutton & Barto, 2018) , PG methods update directly the policy and its parametrization along the gradient direction of the value function (e.g., Williams, 1992; Sutton et al., 2000; Baxter & Bartlett, 2001) . An important variant of PG is the natural policy gradient (NPG) method (Kakade, 2001) . NPG uses the Fisher information matrix of the policy distribution as a preconditioner to improve the policy gradient direction, similar to quasi-Newton methods in classical optimization. Variants of NPG with policy parametrization through deep neural networks were shown to have impressive empirical successes (Schulman et al., 2015; Lillicrap et al., 2016; Mnih et al., 2016; Schulman et al., 2017) . Motivated by the success of NPG in practice, there is now a concerted effort to develop convergence theories for the NPG method. Neu et al. (2017) provide the first interpretation of NPG as a mirror descent (MD) method (Nemirovski & Yudin, 1983; Beck & Teboulle, 2003) . By leveraging different techniques for analyzing MD, it has been established that NPG converges to the global optimum in the tabular case (Agarwal et al., 2021; Khodadadian et al., 2021b; Xiao, 2022) and some more general settings (Shani et al., 2020; Vaswani et al., 2022; Grudzien et al., 2022) . In order to get a fast linear convergence rate for NPG, several recent works consider the regularized NPG methods, such as the entropy-regularized NPG (Cen et al., 2021) and other convex regularized NPG methods (Lan, 2022; Zhan et al., 2021) . By designing appropriate step sizes, Khodadadian et al. (2021b) and Xiao (2022) obtain linear convergence of NPG without regularization. However, all these linear convergence results are limited in the tabular setting with a direct parametrization. It remains unclear whether this same linear convergence rate can be established in function approximation settings. In this paper we provide an affirmative answer to this question for the log-linear policy class. Our approach is based on the framework of compatible function approximation (Sutton et al., 2000; Kakade, 2001) , which was extensively developed by Agarwal et al. (2021) . Using this framework, variants of NPG with log-linear policies can be written as policy mirror descent (PMD) methods with inexact evaluations of the advantage function or Q-function (giving rise to NPG or Q-NPG respectively). Then by extending a recent analysis of PMD (Xiao, 2022) , we obtain a non-asymptotic linear convergence of both NPG and Q-NPG with log-linear policies. A distinctive feature of this approach is the use of a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other (strongly) convex regularization. The extensions are highly nontrivial and require quite different techniques. This linear convergence leads to the Õ(1/ 2 ) sample complexities for both methods. In particular, our sample complexity analysis also fixes errors of previous work. Lastly, as a byproduct, we obtain sublinear convergence rates for both methods with arbitrary constant step size. See Appendix A for a thorough review. In particular, Table 1 provides a complete overview of our results.

2. PRELIMINARIES ON MARKOV DECISION PROCESSES

We consider an MDP denoted as M = {S, A, P, c, γ}, where S is a finite state space, A is a finite action space, P : S × A → S is a Markovian transition model with P(s | s, a) being the transition probability from state s to s under action a, c is a cost function with c(s, a) ∈ [0, 1] for all (s, a) ∈ S × A, and γ ∈ [0, 1) is a discounted factor. Here we use cost instead of reward to better align with the minimization convention in the optimization literature. The agent's behavior is modeled as a stochastic policy π ∈ ∆(A) |S| , where π s ∈ ∆(A) is the probability distribution over actions A in state s ∈ S. At each time t, the agent takes an action a t ∈ A given the current state s t ∈ S, following the policy π, i.e., a t ∼ π st . Then the MDP transitions into the next state s t+1 with probability P(s t+1 | s t , a t ) and the agent encounters the cost c t = c(s t , a t ). Thus, a policy induces a distribution over trajectories {s t , a t , c t } t≥0 . In the infinite-horizon discounted setting, the cost function of π with an initial state s is defined as V s (π) def = E at∼πs t ,st+1∼P(•|st,at) ∞ t=0 γ t c(s t , a t ) | s 0 = s . Given an initial state distribution ρ ∈ ∆(S), the goal of the agent is to find a policy π that minimizes V ρ (π) def = E s∼ρ [V s (π)] = s∈S ρ s V s (π) = V (π), ρ . A more granular characterization of the performance of a policy is the state-action cost function (Q-function). For any pair (s, a) ∈ S × A, it is defined as Q s,a (π) def = E at∼πs t ,st+1∼P(•|st,at) ∞ t=0 γ t c(s t , a t ) | s 0 = s, a 0 = a . (2) Let Q s ∈ R |A| denote the vector [Q s,a ] a∈A . Then we have V s (π) = E a∼πs [Q s,a (π)] = π s , Q s (π) . The advantage functionfoot_0 is a centered version of the Q-function: A s,a (π) def = Q s,a (π) -V s (π), which satisfies E a∼πs [A s,a (π)] = 0 for all s ∈ S. Visitation probabilities. Given a starting state distribution ρ ∈ ∆(S), we define the state visitation distribution d π (ρ) ∈ ∆(S), induced by a policy π, as d π s (ρ) def = (1 -γ) E s0∼ρ ∞ t=0 γ t Pr π (s t = s | s 0 ) , where Pr π (s t = s | s 0 ) is the probability that s t is equal to s by following the trajectory generated by π starting from s 0 . Intuitively, the state visitation distribution measures the probability of being at state s across the entire trajectory. We define the state-action visitation distribution d π (ρ) ∈ ∆(S × A) as d π s,a (ρ) def = d π s (ρ)π s,a = (1 -γ) E s0∼ρ ∞ t=0 γ t Pr π (s t = s, a t = a | s 0 ) . In addition, we extend the definition of d π (ρ) by specifying the initial state-action distribution ν, i.e., d π s,a (ν) def = (1 -γ) E (s0,a0)∼ν ∞ t=0 γ t Pr π (s t = s, a t = a | s 0 , a 0 ) . The difference in the last two definitions is that for the former, the initial action a 0 is sampled directly from π, whereas for the latter, it is prescribed by the initial state-action distribution ν. We use d compared to d to better distinguish the cases with ν and ρ. Without specification, we even omit the argument ν or ρ throughout the paper to simplify the presentation as they are self-evident. From these definitions, we have for all (s, a) ∈ S × A, d π s ≥ (1 -γ)ρ s , d π s,a ≥ (1 -γ)ρ s π s,a , d π s,a ≥ (1 -γ)ν s,a . Policy parametrization. In practice, both the state and action spaces S and A can be large and some form of function approximation is needed to reduce the dimensions and make the computation feasible. In particular, the policy π is often parametrized as π(θ) with θ ∈ R m , where m is much smaller than |S| and |A|. In this paper, we focus on the log-linear policy class. Indeed, we assume that for each state-action pair (s, a), there is a feature mapping φ s,a ∈ R m and the policy takes the form π s,a (θ) = exp(φ s,a θ) a ∈A exp(φ s,a θ) . ( ) This setting is important since it is the simplest instantiation of the widely-used neural policy parametrization. To simplify notations in the rest , we use the shorthand V ρ (θ) for V ρ (π(θ)) and similarly Q s,a (θ) for Q s,a (π(θ)), A s,a (θ) for A s,a (π(θ)), d θ s for d π(θ) s , d θ s,a for d π(θ) s,a , and d θ s,a for d π(θ) s,a . Natural Policy Gradient (NPG) Method. Using the notations defined above, the parametrized policy optimization problem is to minimize the function V ρ (θ) over θ ∈ R m . The policy gradient is given by (see, e.g., Williams, 1992; Sutton et al., 2000 ) ∇ θ V ρ (θ) = 1 1-γ E s∼d θ , a∼πs(θ) [Q s,a (θ) ∇ θ log π s,a (θ)] . For parametrizations that are differentiable and satisfy a∈A π s,a (θ) = 1, including the log-linear class defined in (6), we can replace Q s,a (θ) by A s,a (θ) in the above expression (Agarwal et al., 2021) . The NPG method (Kakade, 2001) takes the form θ (k+1) = θ (k) -η k F ρ θ (k) † ∇ θ V ρ θ (k) , where η k > 0 is a scalar step size, F ρ (θ) is the Fisher information matrix F ρ (θ) def = E s∼d θ , a∼πs(θ) ∇ θ log π s,a (θ) ∇ θ log π s,a (θ) , and F ρ (θ) † denotes the Moore-Penrose pseudoinverse of F ρ (θ).

3. NPG WITH COMPATIBLE FUNCTION APPROXIMATION

The parametrized value function V ρ (θ) is non-convex in general (see, e.g., Agarwal et al., 2021) . Despite being a non-convex optimization problem, there is still additional structure we can leverage to ensure convergence. Following Agarwal et al. (2021) , we adopt the framework of compatible function approximation (Sutton et al., 2000; Kakade, 2001) , which exploits the MDP structure and leads to tight convergence rate analysis. For any w, θ ∈ R m and state-action distribution ζ, define compatible function approximation error as L A (w, θ, ζ) def = E (s,a)∼ζ w ∇ θ log π s,a (θ) -A s,a (θ) 2 . (10) Kakade (2001) showed that the NPG update ( 8) is equivalent to (up to a constant scaling of η k ) θ (k+1) = θ (k) -η k w (k) , w (k) ∈ argmin w∈R m L A w, θ (k) , d (k) , where d (k) is a shorthand for the state-action visitation distribution d π(θ (k) ) (ρ) defined in (3). A derivation of ( 11) is provided in Appendix B (Lemma 1) for completeness. In other words, w is the solution to a regression problem that tries to approximate A s,a (θ (k) ) using ∇ θ log π s,a (θ (k) ) as features. This is where the term "compatible function approximation error" comes from. For the log-linear policy class defined in (6), we have ∇ θ log π s,a (θ) = φs,a (θ) def = φ s,a -a ∈A π s,a (θ)φ s,a = φ s,a -E a ∼πs(θ) [φ s,a ] , where φs,a (θ) are called centered features vectors. In practice, we cannot minimize L A exactly; instead, a sample-based regression problem is solved to obtain an approximate solution w (k) . This leads to the following inexact NPG update rule: θ (k+1) = θ (k) -η k w (k) , w (k) ≈ argmin w L A w, θ (k) , d (k) . The inexact NPG updates require samples of unbiased estimates of A s,a (θ), the corresponding sampling procedure is given in Algorithm 4, and a sample-based regression solver to minimize L A is given in Algorithm 5 in the Appendix. Alternatively, as proposed by Agarwal et al. (2021) , we can define the compatible function approximation error as L Q (w, θ, ζ) def = E (s,a)∼ζ w φ s,a -Q s,a 2 (14) and use it to derive a variant of the inexact NPG update called Q-NPG: θ (k+1) = θ (k) -η k w (k) , w (k) ≈ argmin w L Q w, θ (k) , d (k) . (15) For Q-NPG, the sampling procedure for estimating Q s,a (θ) is given in Algorithm 3 and a samplebased regression solver for w (k) is proposed in Algorithm 6 in the Appendix. The sampling procedure and the regression solver of NPG are less efficient than those of Q-NPG. Indeed, the sampling procedure for A s,a (θ) in Algorithm 4 not only estimates Q s,a (θ), but also requires an additional estimation of V s (θ), and thus doubles the amount of samples as compared to Algorithm 3. Furthermore, the stochastic gradient estimator of L Q in Algorithm 6 only computes on a single action of the feature map φ s,a . Whereas the one of L A in Algorithm 5 computes on the centered feature map φs,a (θ) defined in ( 12), which needs to go through the entire action space, thus is |A| times more expensive to run. See Appendix C for more details. Following Agarwal et al. (2021) , we consider slightly different variants of NPG and Q-NPG, where d (k) in ( 13) and ( 15) is replaced by a more general state-action visitation distribution d (k) = d π(θ (k) ) (ν) defined in (4) with ν ∈ ∆(S × A). The advantage of using d (k) is that it allows better exploration than d (k) as ν can be chosen to be independent to the policy π(θ (k) ). For example, it can be seen from ( 5) that the lower bound of d π is independent to π, which is not the case for d π . This property is crucial in the forthcoming convergence analysis.

3.1. FORMULATION AS INEXACT POLICY MIRROR DESCENT

Given an approximate solution w (k) for minimizing L Q w, θ (k) , d (k) , the Q-NPG update rule θ (k+1) = θ (k) -η k w (k) , when plugged in the log-linear parametrization (6), results in a new policy π (k+1) s,a = π (k) s,a exp -η k φ T s,a w (k) /Z (k) s , ∀ (s, a) ∈ S × A, where π (k) is a shorthand for π s,a (θ (k) ) and Z

(k)

s is a normalization factor to ensure a∈A π (k+1) s,a = 1, for each s ∈ S. We note that the above π (k+1) can also be obtained by a mirror descent update: π (k+1) s = arg min p∈∆(A) η k Φ s w (k) , p + D(p, π (k) s ) , ∀s ∈ S, where Φ s ∈ R |A|×m is a matrix with rows (φ s,a ) ∈ R 1×m for a ∈ A, and D(p, q) denotes the Kullback-Leibler (KL) divergence between two distributions p, q ∈ ∆(A), i.e., D(p, q) def = a∈A p a log pa qa . A derivation of ( 16) is provided in Appendix B (Lemma 2) for completeness. If we replace Φ s w (k) in ( 16) by the vector Q s,a (π (k) ) a∈A ∈ R |A| , then it becomes the policy mirror descent (PMD) method in the tabular setting studied by, for example, Shani et al. (2020) , Lan (2022) and Xiao (2022) . In fact, the update rule ( 16) can be viewed as an inexact PMD method where Q s (π (k) ) is linearly approximated by Φ s w (k) through compatible function approximation ( 14). Similarly, we can write the inexact NPG update rule as π (k+1) s = arg min p∈∆(A) η k Φ(k) s w (k) , p + D(p, π (k) s ) , ∀s ∈ S, where w (k) is an inexact solution for minimizing L A w, θ (k) , d (k) defined in (10), and Φ(k) s ∈ R |A|×m is a matrix whose rows consist of the centered feature maps φs,a (θ (k) ) , as defined in (12). Reformulating Q-NPG and NPG into the mirror descent forms ( 16) and ( 17), respectively, allows us to adapt the analysis of PMD method developed in Xiao (2022) to obtain sharp convergence rates. In particular, we show that with an increasing step size η k ∝ γ k , both NPG and Q-NPG with log-linear policy parametrization converge linearly up to an error floor determined by the quality of the compatible function approximation.

4. ANALYSIS OF Q-NPG WITH LOG-LINEAR POLICIES

In this section, we provide the convergence analysis of the following inexact Q-NPG method θ (k+1) = θ (k) -η k w (k) , w (k) ≈ argmin w L Q w, θ (k) , d (k) , where d (k) is shorthand for d π(θ (k) ) (ν) and ν ∈ ∆(S × A) is an arbitrary state-action distribution that does not depend on ρ. The exact minimizer is denoted as w k) , d (k) . Following Agarwal et al. (2021) , the compatible function approximation error can be decomposed as (k) ∈ argmin w L Q w, θ L Q w (k) , θ (k) , d (k) = L Q w (k) , θ (k) , d (k) -L Q w (k) , θ (k) , d (k) Statistical error (excess risk) + L Q w (k) , θ (k) , d (k) .

Approximation error

The statistical error measures how accurate is our solution to the regression problem, i.e., how good w (k) is compared with w (k) . The approximation error measures the best possible solution for approximating Q s,a (θ (k) ) using φ s,a as features in the regression problem (modeling error). One way to proceed with the analysis is to assume that both the statistical error and the approximation error are bounded for all iterations, which is the approach we take in Section 4.2 and is also the approach we take later in Section 5 for the analysis of the NPG method. However, in Section 4.1, we first take an alternative approach proposed by Agarwal et al. (2021) , where the assumption of bounded approximation error is replaced by a bounded transfer error. The transfer error refers to L Q w (k) , θ (k) , d * , where the iteration-dependent visitation distribution d (k) is shifted to a fixed one d * (defined in Section 4.1). These two approaches require different additional assumptions and result in slightly different convergence rates. Here we first state the common assumption on the bounded statistical error. Assumption 1 (Bounded statistical error, Assumption 6.1.1 in Agarwal et al. (2021) ). There exists stat > 0 such that for all iterations k ≥ 0 of the Q-NPG method (18), we have E L Q w (k) , θ (k) , d (k) -L Q w (k) , θ (k) , d (k) ≤ stat . ( ) By solving the regression problem with sampling based approaches, we can expect stat = O(1/ √ T ) (Agarwal et al., 2021) or stat = O(1/T ) (see Corollary 1) where T is the number of iterations used to find the inexact solution w (k) .

4.1. ANALYSIS WITH BOUNDED TRANSFER ERROR

Here we introduce some additional notations. For any state distributions p, q ∈ ∆(S), we define the distribution mismatch coefficient of p relative to q as p/q ∞ def = max s∈S p s /q s . Let π * be an arbitrary comparator policy, which is not necessarily an optimal policy and does not need to belong to the log-linear policy class. Fix a state distribution ρ ∈ ∆(S). We denote d π * (ρ) as d * and d π(θ (k) ) (ρ) as d (k) , and define the following distribution mismatch coefficients: ϑ k def = d * d (k) ∞ (5) ≤ 1 1 -γ d * ρ ∞ and ϑ ρ def = 1 1 -γ d * ρ ∞ ≥ 1 1 -γ . ( ) Thus, for all k ≥ 0, we have ϑ k ≤ ϑ ρ . We assume that ϑ ρ < ∞, which is the case, for example, if ρ s > 0 for all s ∈ S. This is commonly used in the literature on policy gradient methods (e.g., Zhang et al., 2020; Wang et al., 2020) and the NPG convergence analysis (e.g., Cayci et al., 2021; Xiao, 2022) . We further relax this condition in Appendix H. Given a state distribution ρ and a comparator policy π * , we define a state-action measure d * as d * s,a def = d * s • Unif A (a) def = d * s /|A|, and use it to express the transfer error as L Q w (k) , θ (k) , d * . Assumption 2 (Bounded transfer error, Assumption 6.1.2 in Agarwal et al. (2021) ). There exists bias > 0 such that for all iterations k ≥ 0 of the Q-NPG method (18), we have E L Q w (k) , θ (k) , d * ≤ bias . ( ) The bias is often referred to as the transfer error, since it is the error due to replacing the relevant distribution d(k) by d * . This transfer error bound characterizes how well the Q-values can be linearly approximated by the feature maps φ s,a . It can be shown that bias = 0 when π (k) is the softmax tabular policy (Agarwal et al., 2021) or the MDP has a certain low-rank structure (Jiang et al., 2017; Yang & Wang, 2019; Jin et al., 2020) . For rich neural parametrizations, bias can be made small (Wang et al., 2020) . The next assumption concerns the relative condition number between two covariance matrices of φ s,a defined under different state-action distributions. Assumption 3 (Bounded relative condition number, Assumption 6.2 in Agarwal et al. (2021) ). Fix a state distribution ρ, a state-action distribution ν and a comparator policy π * . Let Σ d * def = E (s,a)∼ d * φ s,a φ s,a , Σ ν def = E (s,a)∼ν φ s,a φ s,a , where d * is specified in (21). We define the relative condition number between Σ d * and Σ ν as κ ν def = max w∈R m w Σ d * w w Σ ν w , ( ) and assume that κ ν is finite. The κ ν is referred to as the relative condition number, since the ratio is between two different matrix induced norm. Assumption 3 benefits from the use of ν. In fact, it is shown in Agarwal et al. (2021, Remark 22 and Lemma 23) that κ ν can be reasonably small (e.g., κ ν ≤ m is always possible) and independent to the size of the state space by controlling ν. Our analysis also needs the following assumption, which does not appear in Agarwal et al. (2021) . Assumption 4 (Concentrability coefficient for state visitation). There exists a finite C ρ > 0 such that for all iterations k ≥ 0 of the Q-NPG method (18), it holds that E s∼d * d (k) s d * s 2 ≤ C ρ . ( ) The concentrability coefficient is studied in the analysis of approximate dynamic programming algorithms (Munos, 2003; 2005; Munos & Szepesvári, 2008) . It measures how much ρ can get amplified in k steps as compared to the reference distribution d * s . Let ρ min = min s∈S ρ s . A sufficient condition for Assumption 4 to hold is that ρ min > 0. Indeed, E s∼d * d (k) s d * s 2 ≤ d (k) d * ∞ (5) ≤ 1 1 -γ d (k) ρ ∞ ≤ 1 (1 -γ)ρ min . ( ) In reality, C ρ can be much smaller than the pessimistic bound shown above. This is especially the case if we choose π * to be the optimal policy and d (k) → d * . We further replace C ρ by C ν defined in Section 4.2 that is independent to ρ and thus is more easily satisfied. Now we present our first main result. Theorem 1. Fix a state distribution ρ, an state-action distribution ν and a comparator policy π * . We consider the Q-NPG method (18) with the uniform initial policyfoot_1 and the step sizes satisfying η 0 ≥ 1-γ γ log |A| and η k+1 ≥ 1 γ η k . Suppose that Assumptions 1, 2, 3 and 4 all hold. Then we have for all k ≥ 0, E V ρ (π (k) ) -V ρ (π * ) ≤ 1 -1 ϑρ k 2 1-γ + 2 √ |A|(ϑρ √ Cρ+1) 1-γ κν 1-γ stat + √ bias . The main differences between our Theorem 1 and Theorem 20 of Agarwal et al. (2021) , which is their corresponding result on the inexact Q-NPG method, are summarized as follows. • The convergence rate of Agarwal et al. (2021, Theorem 20) is O(1/ √ k) up to an error floor determined by stat and bias . We have linear convergence up to an error floor that also depends on stat and bias . However, the magnitude of our error floor is worse (larger) by a factor of ϑ ρ C ρ , due to the concentrability and the distribution mismatch coefficients used in our proof. A very pessimistic bound on this factor is as large as |S| 2 /(1 -γ) 2 . • In terms of required conditions, both results use Assumptions 1, 2 and 3. Agarwal et al. (2021, Theorem 20) further assume that the norms of the feature maps φ s,a are uniformly bounded and w (k) has a bounded norm (e.g., obtained by a projected stochastic gradient descent). Due to different analysis techniques referred next, we avoid such boundedness assumptions but rely on the concentrability coefficient C ρ defined in Assumption 4. • Agarwal et al. (2021, Theorem 20 ) uses a diminishing step size η ∝ 1/ √ k where k is the total number of iterations, but we use a geometrically increasing step size η k ∝ γ k for all k ≥ 0. This discrepancy reflects the different analysis techniques adopted. The key analysis tool in Agarwal et al. (2021) is a NPG Regret Lemma (their Lemma 34) which relies on the smoothness of the functions log π s,a (θ) (thus the boundedness of φ s,a ) and the boundedness of w (k) , and thus the classical O(1/ √ k) diminishing step size in the optimization literature. Our analysis exploits the three-point descent lemma (Chen & Teboulle, 1993) and the performance difference lemma (Kakade & Langford, 2002) , without reliance on smoothness parameters. As a consequence, we can take advantage of exponentially growing step sizes and avoid assuming the boundedness of φ s,a or w (k) . The reason of using increasing step size can be interpreted as approximate policy iteration. See Appendix H the connection with policy iteration for more details. As a by product, we also obtain a sublinear O(1/k) convergence result while using arbitrary constant step size. Theorem 2. Fix a state distribution ρ, an state-action distribution ν and an optimal policy π * . We consider the Q-NPG method (18) with the uniform initial policy and any constant step size η k = η > 0. Suppose that Assumptions 1, 2, 3 and 4 all hold. Then we have for all k ≥ 0, 1 k k-1 t=0 E[V ρ (π (t) )] -V ρ (π * ) ≤ 1 (1-γ)k log |A| η + 2ϑ ρ + 2 √ |A| ϑρ √ Cρ+1 1-γ κν 1-γ stat + √ bias .

4.2. ANALYSIS WITH BOUNDED APPROXIMATION ERROR

In this section, instead of assuming bounded transfer error, we provide a convergence analysis based on the usual notion of approximation error and a weaker concentrability coefficient. Assumption 5 (Bounded approximation error). There exists approx > 0 such that for all iterations k ≥ 0 of the Q-NPG method (18), it holds that E L Q w (k) , θ (k) , d (k) ≤ approx . ( ) As mentioned in Agarwal et al. (2021) , Assumption 5 is stronger than Assumption 2 (bounded transfer error). Indeed, L Q w (k) , θ (k) , d * ≤ d * d (k) ∞ L Q w (k) , θ (k) , d (k) (5) ≤ 1 1 -γ d * ν ∞ L Q w (k) , θ (k) , d (k) . Assumption 6 (Concentrability coefficient for state-action visitation). There exists C ν < ∞ such that for all iterations of the Q-NPG method (18), we have E (s,a)∼ d (k) h (k) s,a d (k) s,a 2 ≤ C ν , where h (k) s,a represents all of the following quantities: d (k+1) s π (k+1) s,a , d (k+1) s π (k) s,a , d * s π (k) s,a , d * s π * s,a . Since we are free to choose ν independently to ρ, we can choose ν s,a > 0 for all (s, a) ∈ S × A for Assumption 6 to hold. Indeed, with ν min denoting min (s,a)∈S×A ν s,a , we have E (s,a)∼ d (k) h (k) s,a d (k) s,a 2 ≤ max (s,a)∈S×A h (k) s,a d (k) s,a ≤ 1 (1 -γ)ν min , where the upper bound can be smaller than that in (26) if ρ min is smaller than ν min . Theorem 3. Fix a state distribution ρ, an state-action distribution ν and a comparator policy π * . We consider Q-NPG method (18) with uniform initial policy and step sizes satisfying η 0 ≥ 1-γ γ log |A| and η k+1 ≥ 1 γ η k . Suppose that Assumptions 1, 5 and 6 hold. Then we have for all k ≥ 0, E[V ρ (π (k) )] -V ρ (π * ) ≤ 1 -1 ϑρ k 2 1-γ + 2 √ Cν (ϑρ+1) 1-γ √ stat + √ approx . Compared to Theorem 1, while the approximation error assumption is stronger than the transfer error assumption, we do not require the assumption on relative condition number κ ν and the error floor does not depends on κ ν nor explicitly on |A|. Besides, we can always choose ν so that C ν is finite even if C ρ is unbounded. However, it is not clear if Theorem 3 is better than Theorem 1. Remark 1. Note that Theorems 1, 2 and 3 benefit from using the visitation distribution d (k) instead of d (k) (i.e., benefit from using ν instead of ρ). In particular, from (5), d (k) has a lower bound that is independent to the policy π (k) or ρ. This property allows us to define a weak notion of relative condition number (Assumption 3) that is independent to the iterates, and also allows us to get a finite upper bound of C ν (Assumption 6 and (30)) that is independent to ρ.

4.3. SAMPLE COMPLEXITY OF Q-NPG

Here we establish the sample complexity results, i.e., total number of samples of single-step interaction with the environment, of a sample-based Q-NPG method (Algorithm 2 in Appendix C). Combined with a simple SGD solver, Q-NPG-SGD in Algorithm 6, the following corollary shows that Algorithm 2 converges globally by further assuming that the feature map is bounded and has non-singular covariance matrix (31). The explicit constants of the bound can be found in the appendix. Corollary 1. Consider the setting of Theorem 3. Suppose that the sample-based Q-NPG Algorithm 2 is run for K iterations, with T gradient steps of Q-NPG-SGD (Algorithm 6) per iteration. Furthermore, suppose that for all (s, a) ∈ S × A, we have φ s,a ≤ B with B > 0, and we choose the step size α = 1 2B 2 and the initialization w 0 = 0 for Q-NPG-SGD. If for all θ ∈ R m , the covariance matrix of the feature map followed by the initial state-action distribution ν satisfies E (s,a)∼ν φ s,a φ s,a (23) = Σ ν ≥ µI m , where I m ∈ R m×m is the identity matrix and µ > 0, then E[V ρ (π (K) )] -V ρ (π * ) ≤ 1 -1 ϑρ K 2 1-γ + O √ approx /(1 -γ) + O √ m/ (1 -γ) 3 √ T . In Q-NPG-SGD, each trajectory has the expected length 1/(1 -γ) (see Lemma 4). Consequently, with K = O(log(1/ ) log(1/(1 -γ))) and T = O 1 (1-γ) 6 2 , Q-NPG requires K * T /(1 -γ) = Õ 1 (1-γ) 7 2 samples such that E V ρ (π (K) ) -V ρ (π * ) ≤ O( )+O √ approx 1-γ . The Õ(1/ 2 ) sample complexity matches with the one of value-based algorithms such as Q-learning (Li et al., 2020) . Compared to Agarwal et al. (2021, Corollary 26) for the sampled based Q-NPG Algorithm 2, their sample complexity is O 1 (1-γ) 11 6 with K = 1 (1-γ) 2 2 and T = 1 (1-γ) 8 4 . Despite the improvement on the convergence rate for K, they use the optimization results of Shalev-Shwartz & Ben-David (2014, Theorem 14.8 ) to obtain stat = O(1/ √ T ), while we use the one of Bach & Moulines (2013, Theorem 1) (see Theorem 12 as well) to establish faster stat = O(1/T ). Besides, they use the projected SGD method and require that the stochastic gradient is bounded which is incorrectly verified in their prooffoot_2 . In contrast, to apply Theorem 12, the boundedness of the stochastic gradient is not necessary. Alternatively, we require a different condition (31). A proof sketch of our corollary and a discussion of the condition (31) are provided in Appendix D.5 for more details.

5. ANALYSIS OF NPG WITH LOG-LINEAR POLICIES

We now return to the convergence analysis of the inexact NPG method, specifically, θ (k+1) = θ (k) -η k w (k) , w (k) ≈ argmin w L A w, θ (k) , d (k) . Again, let w (k) ∈ argmin w L A w, θ (k) , d (k) denote the minimizer. Our analysis of NPG is analogous to that of Q-NPG shown in the previous section. That is, we again exploit the inexact PMD formulation (17) and use techniques developed in Xiao (2022) . The set of assumptions we use for NPG is analogous to the assumptions used in Section 4.2. In particular, we assume a bounded approximation error instead of transfer error (c.f., Assumption 2) in minimizing L A and do not need the assumption on relative condition number. Assumption 7 (Bounded statistical error, Assumption 6.5.1 in Agarwal et al. (2021) ). There exists stat > 0 such that for all iterations k ≥ 0 of the NPG method (32), we have E L A w (k) , θ (k) , d (k) -L A w (k) , θ (k) , d (k) ≤ stat . ( ) Assumption 8 (Bounded approximation error). There exists approx > 0 such that for all iterations k ≥ 0 of the NPG method (32), we have E L A w (k) , θ (k) , d (k) ≤ approx . ( ) Assumption 9 (Concentrability coefficient for state-action visitation). There exists C ν < ∞ such that for all iterations k ≥ 0 of the NPG method (32), we have E (s,a)∼ d (k) d (k+1) s,a d (k) s,a 2 ≤ C ν and E (s,a)∼ d (k) d π * s,a d (k) s,a 2 ≤ C ν . Under the above assumptions, we have the following result. Theorem 4. Fix a state distribution ρ, a state-action distribution ν, and a comparator policy π * . We consider the NPG method (32) with uniform initial policy and step sizes satisfying η 0 ≥ 1-γ γ log |A| and η k+1 ≥ 1 γ η k . Suppose that Assumptions 7, 8 and 9 hold. Then we have for all k ≥ 0, E[V ρ (π (k) )] -V ρ (π * ) ≤ 1 - 1 ϑ ρ k 2 1 -γ + √ C ν (ϑ ρ + 1) 1 -γ √ stat + √ approx . Now we compare Theorem 4 with Theorem 29 in Agarwal et al. (2021) for the NPG analysis. The main differences are similar to those for Q-NPG as summarized right after Theorem 1: Their convergence rate is sublinear while ours is linear; they assume uniformly bounded φ s,a and w (k) while we require bounded concentrability coefficient C ν due to different proof techniques; they use diminishing step sizes and we use geometrically increasing ones. Moreover, Theorem 4 requires bounded approximation error, which is a stronger assumption than the bounded transfer error used by their Theorem 29, but we do not need the assumption on bounded relative condition number. In particular, such bounded relative condition number must hold for the covariance matrix of φs,a (θ (k) ) for all k ≥ 0 which depends on the iterates θ (k) . This is in contrast to our Assumption 3, where we use a single fixed covariance matrix for Q-NPG that is independent to the iterates, as defined in ( 23). In addition, the inequalities in ( 35) only involve half of the state-action visitation distributions listed in (29), i.e., the first and the fourth terms. Thus, C ν in (35) can share the same upper bound in (30) independent to the use of the algorithm Q-NPG or NPG. Consequently, our concentrability coefficient C ν in ( 35) is weaker than Assumption 2 in Cayci et al. (2021) which studies the linear convergence of NPG with entropy regularization for the log-linear policy class. The reason is that the bound on C ν in (30) does not depend on the policies throughout the iterations thanks to the use of d (k) instead of d (k) (see Remark 1 as well). See Appendix H for a thorough discussion on the concentrability coefficient C ν . As a by product, we also obtain a sublinear rate for NPG while using an unconstrained constant step size in Theorem 9 in Appendix F. By further assuming that the feature map is bounded and the Fisher information matrix ( 9) is non-singular, we also provide a Õ(1/ 2 ) sample complexity result of NPG as for Q-NPG and fix the errors occurred in the NPG sample complexity proofs of Agarwal et al. (2021) and Liu et al. (2020) in Appendix G. Here we provide the related work discussion, the missing proofs from the main paper and some additional noteworthy observations made in the main paper.

A RELATED WORK

We provide an extended discussion for the context of our work, including a discussion comparing the technical novelty of the paper to the analysis of NPG with softmax tabular policies in Xiao (2022) and a comparison of the convergence theories of NPG in the literature. Furthermore, we discuss future work, such as extending our analysis to other RL settings. A.1 TECHNICAL CONTRIBUTION AND NOVELTY COMPARED TO XIAO (2022) Our technical novelty compared to Xiao ( 2022) is summarized as follows. • Our linear convergence results (i.e., Theorem 1, 3 and 4) are not direct applications of Theorem 10 in Xiao (2022) . Indeed, Xiao (2022) establishes the connection between NPG and a specific form of policy mirror descent (PMD) with the use of the weighted Bregman divergence for the tabular setting, while we show that this connection can also be established for the function approximation setting via the compatible function approximation framework (10). We also modify the PMD framework of Xiao (2022) with the linear approximation of the advantage function in (17), inspired from the compatible function approximation framework. Thus, the approaches of deriving the PMD form update are different. Without this work of using the compatible function approximation framework to bridge NPG and PMD, it was not clear at all that the analysis of Xiao ( 2022) could be extended in the log-linear policy setting. So our work is the first step of showing that the proof techniques used in Xiao ( 2022) can be extended in function approximation regime. In fact, the extension is highly nontrivial and requires significant innovation (see details below). As for future work, one can extend our work to other function approximation setting through a similar compatible function approximation framework. See Appendix A.3 for more details about the future work. • Besides, our linear convergence results only consider the inexact NPG update. Compared to Theorem 14 in Xiao (2022) , which is their corresponding result on the inexact PMD method, we improve their analysis by making much weaker assumptions on the accuracy of the estimation Q(π). Xiao ( 2022) requires an L ∞ supremum norm bound on the estimation error of Q, i.e., Q(π) -Q(π) ∞ ≤ stat , whereas our convergence guarantee depends on the expected L 2 error of the estimate, i.e., Assumption 1 and 7. For instance, Assumption 1 from equation ( 62) can be written as E (φ s,a w (k) -φ s,a w (k) ) 2 ≤ stat , which can be interpreted as E ( Q(π) -Q(π) ) 2 ≤ stat under the linear approximation setting. The techniques for handling L ∞ and L 2 errors are very different. Not only our assumption is weaker, it also benefits from the sample complexity analysis that we explain next. • Consequently, when considering the sample complexity results we derived for sample-based (Q)-NPG in Corollary 1 and 3, the difference between our work and Theorem 16 in Xiao (2022) , which corresponds to their sample complexity results, is even more significant. Corollary 1 with Algorithm Q-NPG-SGD (Algorithm 6) satisfies Assumption 1 with a number of samples that depends only on the feature dimension m of φ and does not depend on the cardinality of state space |S| or action space |A|. In contrast, the assumption Xiao (2022, Theorem 16 ) causes the sample complexity to depend on |S||A|. Furthermore, Xiao (2022) uses a Monte-Carlo approach with multiple independent rollouts per iteration, while our sample-based (Q)-NPG uses one single rollout (Algorithm 3 and 4) combined with regression solvers; Xiao (2022) derives a high probability sample complexity result, while we derive the convergence of the optimality gap E V ρ (π (K) ) -V ρ (π * ) which can guarantee that the variance of V ρ (π (K) ) converges to zero. Thus, our sample-based algorithms had not been considered in Xiao (2022) and our proofs of Corollary 1 and 3 require a different approach. In particular, our sample complexity analysis regarding to the policy evaluation is novel. Although our sample-based algorithms had been considered previously in Agarwal et al. (2021) and Liu et al. (2020) , none of their analysis on the sample complexity was correct. Indeed, Agarwal et al. (2021) required the boundedness of the stochastic gradient estimator, which might not hold as we extensively discussed in Appendix D.5. We fixed this by showing that E Q s,a (θ) 2 is bounded. See Appendix D.5 for all the subtleties, including a proof sketch of Corollary 1. Liu et al. (2020) also incorrectly used an inequality where the random variables are correlated. See the detailed explanation (Footnote 7) in Appendix G.1. We fixed this error with a careful conditional expectation argument. Please refer to Appendix G.1 for all the details, including a proof sketch of Corollary 3. These dimensions are where an important part of the technical work was done. Therefore, outside of the tabular setting, and considering NPG methods that make use of a regression solver, our complexity analysis is currently the only analysis that is entirely correct that we are aware of. • Finally we not only extend the work of Xiao (2022) to NPG for log-linear policy, but also consider the Q-NPG method and establish its linear convergence analysis. This is a method that is unique to log-linear policy and again had not been considered in Xiao (2022) . Q(π) -Q(π) ∞ ≤ stat with the L ∞ norm in

A.2 FINITE-TIME ANALYSIS OF THE NATURAL POLICY GRADIENT

NPG for the softmax tabular policies. For the softmax tabular policies, Shani et al. (2020) show that the unregularized NPG has a O(1/ √ k) convergence rate and the regularized NPG has a faster O(1/k) convergence rate by using a decaying step size. Agarwal et al. (2021) improve the convergence rate of the unregularized NPG to O(1/k) with constant step sizes. Further, Khodadadian et al. (2021a) also achieves O(1/k) convergence rate for the off-policy natural actor-critic (NAC), and a slower sublinear result is established by Khodadadian et al. (2022a) for the two-time-scale NAC. By using the entropy regularization, Cen et al. (2021) achieve a linear convergence rate for NPG. A similar linear convergence result has been obtained by rewriting the NPG update under the PMD framework with the Kullback-Leibler (KL) divergence (Lan, 2022) or with a more general convex regularizer (Zhan et al., 2021) . Such approach is also applied in the averaged MDP setting to achieve linear convergence for NPG (Li et al., 2022a) . However, adding regularization might induce bias for the solution. Thus, Lan (2022) considers exponentially diminishing regularization to guarantee unbiased solution. Furthermore, by considering both the KL divergence and the diminishing entropy regularization, Li et al. (2022b) establish the linear convergence rate not only for the optimality gap but also for the policy. That is, the policy will converge to the fixed high entropy optimal policy. Consequently, Li et al. (2022b) show a local super-linear convergence of both the policy and optimality gap, as discussed in Xiao (2022, Section 4.3) . Recently, Bhandari & Russo (2021) , Khodadadian et al. (2021b; 2022b) and Xiao (2022) show that regularization is unnecessary for obtaining linear convergence, and it suffices to use appropriate step sizes for NPG. In particular, Bhandari & Russo (2021) propose to use an exact line search for the step size (Theorem 1 (a)) or to choose an adaptive step size (Theorem 1 (c)). Similar adaptive step size is proposed by Khodadadian et al. (2021b; 2022b) . Notice that such adaptive step size requires complete knowledge about the environmental model. Instead, a sufficiently large step size might be enough. In this paper, we extend the results of Xiao (2022) from the tabular setting to the log-linear policies, using non-adaptive geometrically increasing step size and obtaining a linear convergence rate for NPG without regularization. NPG with function approximation. In the function approximation regime, there have been many works investigating the convergence rate of the NPG or NAC algorithms from different perspectives. Wang et al. (2020) In contrast, we show that by using a simple geometrically increasing step size, fast linear convergence can be achieved for log-linear policies without any additional regularization nor a projection step. We notice that Chen & Theja Maguluri (2022, Theorem 3.4)foot_3 also uses increasing step size and achieves linear convergence for log-linear policies without regularization. The main differences between our result and Theorem 3.4 in Chen & Theja Maguluri (2022) are fourfold. First, they rely on the contraction property of the generalized Bellman operator, while we consider the PMD analysis approach. So the proof techniques are completely different. Second, their parameter update results in the off-policy multi-step temporal difference learning, whereas we require to solve a linear regression problem to minimize the function approximation error. Third, their step size still depends on the iterates which is thus an adaptive step size and is proportional to the total number of iterations K, while ours is independent to the iterates nor to K. Finally, their assumption on the modeling error requires an L ∞ supremum norm, i.e., Q s (θ (k) ) -Φw (k) ∞ ≤ bias for all states s of the state space, our convergence guarantee depends on the expected error (e.g., Assumption 2, 5 or 8) which is a much weaker assumption. After our submission, we are aware of the concurrent work of Alfano & Rebeschini (2022) . They only analyze the Q-NPG method and achieve similar linear convergence results as our Theorem 1. In particular, their result in Theorem 4.7 has a better concentrability coefficient compared to our Theorem 1. However, their Assumption 4.6 assumes that the relative condition number upper bounds a time-varying ratio which depends on the iterates, while our Assumption 3 is independent to the iterates, as defined in (24). Furthermore, they only consider the case when the initial state distribution is the same as the target state distribution, while our analysis generalizes with any target state distribution, which is extensively discussed on the distribution mismatch coefficients in Appendix H. See Table 1 a complete overview of NPG in the function approximation regime. Fast linear convergence of other policy gradient methods. Different to the PMD analysis approach, by leveraging a gradient dominance property (Polyak, 1963; Łojasiewicz, 1963) , fast linear convergence results have also been established for the PG methods under different settings, such as the linear quadratic control problems (Fazel et al., 2018) and the exact PG method with softmax tabular policy and entropy regularization (Mei et al., 2020; Yuan et al., 2022) . Such gradient domination property is widely explored by Bhandari & Russo (2019) to identify more general structural MDP settings. Linear convergence of PG can also be obtained through exact line search (Bhandari & Russo, 2021 , Theorem 1 (a)) or by exploiting non-uniform smoothness (Mei et al., 2021) . Alternatively, by considering a general strongly-concave utility function of the state-action occupancy measure and by exploiting the hidden convexity of the problem, Zhang et al. (2020) also achieve the linear convergence of a variational PG method. When the object is relaxed to a general concave utility function, Zhang et al. (2021) still achieve the linear convergence by leveraging the hidden convexity of the problem and by adding variance reduction to the PG method.

A.3 FUTURE WORK

The main focus of this paper was the theoretical analysis of NPG method. The results we have obtained open up several experimental questions related to parameter settings for NPG and Q-NPG. We leave such questions as an important future work to further support our theoretical findings. An interesting application from our work is to investigate the sample complexity of natural actorcritic with our PMD analysis. Indeed, our paper obtains w (k) by a regression solver. One can also use temporal difference (TD) learning (e.g., Cayci et al. (2021) ; Chen & Theja Maguluri (2022); Telgarsky ( 2022)) with Markovian sampling to achieve similar O(1/ 2 ) sample complexity result. The performance analysis of TD learning will be expressed for stat , which directly imply the total sample complexity results through our theorems. One natural question is whether we can extend our analysis to the general policy classes. Here we provide one possible way. It can be extended by using a similar compatible function approximation framework. Concretely, consider the parameterized policy π s,a (θ) = exp(f s,a (θ)) a ∈A exp(f s,a (θ)) , where f s,a (θ) is parameterized by θ ∈ R m and is differential. As Agarwal et al. ( 2021) mentioned, the gradient can be written as ∇ θ log π s,a (θ) = g s,a (θ) where g s,a (θ) = ∇ θ f s,a (θ) -E a ∼πs(θ) [∇ θ f s,a (θ)] . The NPG update is equivalent to the following compatible function approximation framework θ (k+1) = θ (k) -η k w (k) , w (k) ∈ arg min w E (s,a)∼ d (k) A s,a (θ (k) ) -w g s,a (θ (k) ) 2 . As Alfano & Rebeschini (2022, Remark 4.8 ) mentioned, if we assume that for all (s, a) ∈ S × A, function f (θ) satisfies f s,a (θ (k+1) ) = f s,a (θ (k) ) -η k (w (k) ) g s,a (θ (k) ), which is the case for the log-linear policies, then one can easily verify that the NPG update resulted in a new policy is also equivalent to the policy mirror descent update π (k+1) s = arg min p∈∆(A) η k G (k) s w (k) , p + D(p, π (k) s ) , ∀s ∈ S, where G (k) s ∈ R |A|×m is a matrix with rows (g s,a (θ (k) )) ∈ R 1×m for a ∈ A. Consequently, one can extend our work naturally in this general setting to derive linear convergence analysis for NPG. Perhaps one can consider the exponential tilting, a generalization of Softmax to more general probability distributions. Another interesting venue of investigation is to consider the generalized linear model instead of linear function approximation for the Q function and the advantage function. One interesting open question is that is there a way to increase stepsize when the discount factor is unknown. So far the PMD proof techniques used in Lan (2022) ; Xiao (2022) and ours require that the discount factor is known. Perhaps the work of Li et al. (2022a) can help to find a way to increase stepsize when the discount factor is unknown. Indeed, Li et al. (2022a) consider the averaged MDP setting. So there is no discount factor. They achieve linear convergence for NPG by increasing the stepsize with some regularization parameters. It will be interesting to investigate if the way of increasing stepsize in Li et al. (2022a) can be applied in our setting.

B STANDARD REINFORCEMENT LEARNING RESULTS

In this section, we prove the standard reinforcement learning results used in our main paper, including the NPG updates written through the compatible function approximation (11) and the NPG updates formalized as policy mirror descent (( 16) and ( 17)). Then, we prove the performance difference lemma (Kakade & Langford, 2002) , which is the first key ingredient for our PMD analysis. The three-point descent lemma (Lemma 11) is the second key ingredient for our PMD analysis. Lemma 1 (NPG updates via compatible function approximation, Theorem 1 in Kakade (2001) ). Consider the NPG updates (8) θ (k+1) = θ (k) -η k F ρ θ (k) † ∇ θ V ρ θ (k) , and the updates using the compatible function approximation (11) θ (k+1) = θ (k) -η k w (k) , where w (k) ∈ argmin w∈R m L A w, θ (k) , d (k) . If the parametrized policy is differentiable for all θ ∈ R m , then the two updates are equivalent up to a constant scaling (1 -γ) of η k . Proof. Indeed, using the policy gradient ( 7) and the fact that a∈A ∇π s,a (θ) = 0 for all s ∈ S, as π(θ) is differentiable on θ and a∈A π s,a = 1, we have the policy gradient theorem (Sutton et al., 2000 ) ∇ θ V ρ (θ) = 1 1 -γ E s∼d θ , a∼πs(θ) [A s,a (θ) ∇ θ log π s,a (θ)] . Furthermore, consider the optima w (k) . By the first-order optimality condition, we have ∇ w L A (w (k) , θ (k) , d (k) ) = 0 ⇐⇒ E (s,a)∼ d (k) (w (k) ) ∇ θ log π (k) s,a -A s,a (θ (k) ) ∇ θ log π (k) s,a = 0 ⇐⇒ E (s,a)∼ d (k) ∇ θ log π (k) s,a ∇ θ log π (k) s,a w (k) = E (s,a)∼ d (k) A s,a (θ (k) )∇ θ log π (k) s,a +( 36) ⇐⇒ F ρ (θ (k) )w (k) = (1 -γ)∇ θ V ρ (θ (k) ). Thus, we have w (k) = (1 -γ)F ρ (θ) † ∇ θ V ρ (θ (k) ) which yields the update (8) up to a constant scaling (1 -γ) of η k . Lemma 2 (NPG updates as policy mirror descent). The closed form solution to (16) is given by π (k+1) s = π (k) s exp -η k Φ s w (k) a∈A π (k) s,a exp -η k φ s,a w (k) (37) = π (k) s exp -η k Φ(k) s w (k) a∈A π (k) s,a exp -η k φs,a (θ (k) ) w (k) (38) = arg min p∈∆(A) η k Φ(k) s w (k) , p + D(p, π (k) s ) , ∀s ∈ S, where is the element-wise product between vectors, and Φ(k) s ∈ R |A|×m is defined in (17), i.e. Φ(k) s,a def = φs,a (θ (k) ) (12) = φ s,a -E a ∼π (k) s [φ s,a ] . Such policy update coincides the inexact NPG updates (32) of the log-linear policy, if k) , d(k) ); and coincides the inexact Q-NPG updates (18) of the log-linear policy, if θ (k+1) = θ (k) - η k w (k) with w (k) ≈ argmin w L A (w, θ θ (k+1) = θ (k) -η k w (k) with w (k) ≈ argmin w L Q (w, θ (k) , d(k) ). Proof. For shorthand, let g = Φ s w (k) . Thus, (16) fits the format of Lemma 10 in Appendix I where q = π (k) s . Consequently, the closed form solution is given by (99), that is π (k+1) s = π (k) s e -η k g a∈A π (k) s,a e -η k ga = π (k) s e -η k Φsw (k) a∈A π (k) s,a e -η k φ s,a w (k) = π (k) s exp -η k Φs (θ (k) )w (k) a∈A π (k) s,a exp -η k φs,a (θ (k) ) w (k) , ( ) where the last equality is obtained as φs,a (θ (k) ) = φ s,a -E a ∼π (k) s [φ s,a ] = φ s,a -c s , with c s ∈ R some constant independent to a. Similarly, by applying Lemma 10 with g = Φ(k) s w (k) , the closed form solution to ( 39) is (40). As for the closed form updates of the policy for NPG (32) and Q-NPG (18) with the parameter updates θ (k+1) = θ (k) -η k w (k) , it is straightforward to verify that it coincides (37) and ( 38) given the specific structure of the log-linear policy (6), which concludes the proof. Lemma 3 (Performance difference lemma (Kakade & Langford, 2002) ). For any policy π, π ∈ ∆(A) S and ρ ∈ ∆(S), V ρ (π) -V ρ (π ) = 1 1 -γ E (s,a)∼ d π [A s,a (π )] (41) = 1 1 -γ E s∼d π [ Q s (π ), π s -π s ] , where Q s (π) is the shorthand for [Q s,a (π)] a∈A ∈ R |A| for any policy π. Proof. From Lemma 2 in Agarwal et al. ( 2021), we have V ρ (π) -V ρ (π ) = 1 1 -γ E (s,a)∼ d π [A s,a (π )] = 1 1 -γ E s∼d π [ A s (π ), π s ] , where A s (π) is the shorthand for [A s,a (π)] a∈A ∈ R |A| for any policy π. To show (42), it suffices to show A s (π ), π s = Q s (π ), π s -π s , for all s ∈ S and π, π ∈ ∆(A) S . Let 1 n denote a vector in R n with coordinates equal to 1 element-wisely. Indeed, using the definition of the advantage function A s,a (π) = Q s,a (π) -V s (π), we have A s (π ), π s = Q s (π ) -V s (π ) • 1 |A| , π s = Q s (π ), π s -V s (π ) • 1 |A| , π s = Q s (π ), π s -V s (π ) (1) = Q s (π ), π s -π s , from which we conclude the proof.

C ALGORITHMS

C.1 NPG AND Q-NPG ALGORITHM Algorithm 1 combined with the sampling procedure (Algorithm 4) and the averaged SGD procedure, called NPG-SGD (Algorithm 5), provide the sample-based NPG methods. Similarly, Algorithm 2 combined with the sampling procedure (Algorithm 3) and the averaged SGD procedure, called Q-NPG-SGD (Algorithm 6), provide the sample-based Q-NPG methods.

C.2 SAMPLING PROCEDURES

In practice, we cannot compute the true minimizer w (k) of the regression problem in either (32) or ( 18), since computing the expectation L A or L Q requires averaging over all state-action pairs (s, a) ∼ d (k) and averaging over all trajectories (s 0 , a 0 , c 0 , s 1 , • • • ) to compute the values of Q (k) s,a and A (k) s,a . So instead, we provide a sampler which is able to obtain unbiased estimates of Q s,a (θ) (or A s,a (θ)) with (s, a) ∼ d θ (ν) for any π(θ). Algorithm 1: Natural policy gradient Input: Initial state-action distribution ν, policy π (0) , discounted factor γ ∈ [0, 1), step size η 0 > 0 for NPG update, step size α > 0 for NPG-SGD update, number of iterations T for NPG-SGD for k = 0 to K -1 do Compute w (k) of ( 32) by NPG-SGD, i.e., Algorithm 5 with inputs (T, ν, π (k) , γ, α) Update θ (k+1) = θ (k) -η k w (k) and η k Output: π (K) Algorithm 2: Q-Natural policy gradient Input: Initial state-action distribution ν, policy π (0) , discounted factor γ ∈ [0, 1), step size 18) by Q-NPG-SGD, i.e., Algorithm 6 with inputs (T, ν, π (k) , γ, α) η 0 > 0 for Q-NPG update, step size α > 0 for Q-NPG-SGD update, number of iterations T for Q-NPG-SGD for k = 0 to K -1 do Compute w (k) of ( Update θ (k+1) = θ (k) -η k w (k) and η k Output: π θ (K) Algorithm 3: Sampler for: (s, a) ∼ d θ (ν) and unbiased estimate Q s,a (θ) of Q s,a (θ) Input: Initial state-action distribution ν, policy π(θ), discounted factor γ ∈ [0, 1) Initialize (s 0 , a 0 ) ∼ ν, the time step h, t = 0, the variable X = 1 while X = 1 do With probability γ: Sample s h+1 ∼ P(• | s h , a h ) Sample a h+1 ∼ π s h+1 (θ) h ← h + 1 Otherwise with probability (1 -γ): X = 0 Accept (s h , a h ) X = 1 Set the estimate Q s h ,a h (θ) = c(s h , a h ) Start to estimate Q s h ,a h (θ) t = h while X = 1 do With probability γ: Sample s t+1 ∼ P(• | s t , a t ) Sample a t+1 ∼ π st+1 (θ) Q s h ,a h (θ) ← Q s h ,a h (θ) + c(s t+1 , a t+1 ) t ← t + 1 Otherwise with probability (1 -γ): X = 0 Accept Q s h ,a h (θ) Output: (s h , a h ) and Q s h ,a h (θ) To solve (18), we sample (s, a) ∼ d (k) and Q (k) s,a by a standard rollout, formalized in Algorithm 3. This sampling procedure is commonly used, for example in Agarwal et al. (2021, Algorithm 1) . It is straightforward to verify that (s h , a h ) and Q s h ,a h (θ) obtained in Algorithm 3 are unbiased for any π(θ). The expected length of the trajectory is 1 1-γ . We provide its proof here for completeness. Lemma 4. Consider the output (s h , a h ) and Q s h ,a h (θ) of Algorithm 3. It follows that E [h + 1] = 1 1 -γ , Pr(s h = s, a h = a) = d θ s,a (ν), E Q s h ,a h (θ) | s h , a h = Q s h ,a h (θ). Proof. The expected length (h + 1) of sampling (s, a) is E [h + 1] = ∞ k=0 Pr(h = k)(k + 1) = (1 -γ) ∞ k=0 γ k (k + 1) = 1 1 -γ . The probability of the state-action pair (s, a) being sampled by Algorithm 3 is Pr(s h = s, a h = a) = (s0,a0)∈S×A ν s0,a0 ∞ k=0 Pr(h = k) Pr π(θ) (s h = s, a h = a | h = k, s 0 , a 0 ) = (s0,a0)∈S×A ν s0,a0 (1 -γ) ∞ k=0 γ k Pr π(θ) (s k = s, a k = a | s 0 , a 0 ) (4) = d θ s,a (ν). Now we verify that Q s h ,a h (θ) obtained from Algorithm 3 is an unbiased estimate of Q s h ,a h (θ). Indeed, from Algorithm 3, we have Q s h ,a h (θ) = H t=0 c(s t+h , a t+h ), where (H + 1) is the length of the horizon executed between lines 13 and 19 in Algorithm 3 for calculating Q s h ,a h (θ). To simplify notation, we consider the estimate of Q s,a for any (s, a) ∈ S × A following the same procedure starting from line 10 in Algorithm 3. Taking expectation, we have E Q s,a (θ) | s, a = E H t=0 c(s t , a t ) | s 0 = s, a 0 = a = ∞ k=0 Pr(H = k)E H t=0 c(s t , a t ) | s 0 = s, a 0 = a, H = k = ∞ k=0 (1 -γ)γ k E k t=0 c(s t , a t ) | s 0 = s, a 0 = a = (1 -γ)E ∞ t=0 c(s t , a t ) ∞ k=t γ k | s 0 = s, a 0 = a = E ∞ t=0 γ k c(s t , a t ) | s 0 = s, a 0 = a (2) = Q s,a (θ). The desired result is obtained by setting s = s h and a = a h . Similar to Algorithm 3, to solve (32), we sample (s, a) ∼ d (k) by the same procedure and estimate A (k) s,a with a slight modification, namely Algorithm 4 (also see Agarwal et al., 2021, Algorithm 3) . Algorithm 4: Sampler for: (s, a) ∼ d θ (ν) and unbiased estimate A s,a (θ) of A s,a (θ) Input: Initial state-action distribution ν, policy π(θ), discounted factor γ ∈ [0, 1) Initialize (s 0 , a 0 ) ∼ ν, the time step h, t = 0, the variable X = 1 while X = 1 do With probability γ: Sample s h+1 ∼ P(• | s h , a h ) Sample a h+1 ∼ π s h+1 (θ) h ← h + 1 Otherwise with probability (1 -γ): X = 0 Accept (s h , a h ) X = 1 Set the estimate Q s h ,a h (θ) = c(s h , a h ) Start to estimate Q s h ,a h (θ) t = h while X = 1 do With probability γ: Sample s t+1 ∼ P(• | s t , a t ) Sample a t+1 ∼ π st+1 (θ) Q s h ,a h (θ) ← Q s h ,a h (θ) + c(s t+1 , a t+1 ) t ← t + 1 Otherwise with probability (1 -γ): X = 0 Accept Q s h ,a h (θ) X = 1 Set the estimate V s h (θ) = 0 Start to estimate V s h (θ) t = h while X = 1 do Sample a t ∼ π st (θ) V s h (θ) ← V s h (θ) + c(s t , a t ) With probability γ: Sample s t+1 ∼ P(• | s t , a t ) t ← t + 1 Otherwise with probability (1 -γ): X = 0 Accept V s h (θ) Output: (s h , a h ) and A s h ,a h (θ) = Q s h ,a h (θ) -V s h (θ) Notice that the sampling procedure for estimating Q s,a (θ) in Algorithm 3 is simpler than that for estimating A s,a (θ) in Algorithm 4, since Algorithm 4 requires an additional estimation of V s (θ) and thus doubles the number of samples to estimate A s,a (θ). As in Lemma 4, we verify in the following lemma that the output (s h , a h ) is sampled from the distribution d θ and A s h ,a h (θ) in Algorithm 4 is an unbiased estimator of A s h ,a h (θ) for all policy π(θ). Lemma 5. Consider the output (s h , a h ) and A s h ,a h (θ) of Algorithm 4. It follows that E [h + 1] = 1 1 -γ , Pr(s h = s, a h = a) = d θ s,a (ν), E A s h ,a h (θ) | s h , a h = A s h ,a h (θ). Proof. Since the procedure of sampling (s h , a h ) in Algorithm 4 is identical to the one in Algorithm 3, from Lemma 4, the first two results are verified. It remains to show that A s h ,a h (θ) is unbiased. The estimation of A s h ,a h (θ) is decomposed into the estimations of Q s h ,a h (θ) and V s h (θ). The procedure of estimating Q s h ,a h (θ) is also identical to the one in Algorithm 3. Thus, from Lemma 4, we have E Q s h ,a h (θ) | s h , a h = Q s h ,a h (θ). By following the similar arguments of Lemma 4, one can verify that E V s h (θ) | s h , a h = V s h (θ). Combine the above two equalities and obtain that E A s h ,a h (θ) | s h , a h = E Q s h ,a h (θ) -V s h (θ) | s h , a h = Q s h ,a h (θ) -V s h (θ) = A s h ,a h (θ).

C.3 SGD PROCEDURES FOR SOLVING THE REGRESSION PROBLEMS OF NPG AND Q-NPG

Once we obtain the sampled (s, a) and A s,a (θ (k) ) from Algorithm 4, we can apply the averaged SGD algorithm as in Bach & Moulines (2013) to solve the regression problem (32) of NPG for every iteration k. Here we suppress the superscript (k). For any parameter θ ∈ R m , recall the compatible function approximation L A in (32) L A (w, θ, d θ ) = E (s,a)∼ d θ w φs,a (θ) -A s,a (θ) 2 . With the output (s, a) ∼ d θ and A s,a (θ) from Algorithm 4 (here we suppress the subscript h), we compute the stochastic gradient estimator of the function L A in (32) by ∇ w L A (w, θ, d θ ) def = 2 w φs,a (θ) -A s,a (θ) φs,a (θ). Next, we show that ( 44) is an unbiased gradient estimator of the loss function L A . Lemma 6. Consider the output (s, a) and A s,a (θ) of Algorithm 4 and the stochastic gradient (44). It follows that E ∇ w L A (w, θ, d θ ) = ∇ w L A (w, θ, d θ ), where the expectation is with respect to the randomness in the sequence of the sampled s 0 , a 0 , • • • , s t , a t from Algorithm 4. Proof. The total expectation of the stochastic gradient is given by E ∇ w L A (w, θ, d θ ) = E s, a, As,a(θ) 2 w φs,a (θ) -A s,a (θ) φs,a (θ) = E (s,a)∼ d θ , As,a(θ) 2 w φs,a (θ) -A s,a (θ) φs,a (θ) | s, a ,(45) where the second line is obtained by (s, a) ∼ d θ from Lemma 5. From Lemma 5, we have E s0,a0,••• ,st,at A s,a (θ) | s 0 = s, a 0 = a = A s,a (θ). Combining the above two equalities yield E ∇ w L A (w, θ, d θ ) (45) = E (s,a)∼ d θ 2 w φs,a (θ) -E A s,a (θ) | s, a φs,a (θ) (46) = E (s,a)∼ d θ 2 w φs,a (θ) -A s,a (θ) φs,a (θ) = ∇ w L A (w, θ, d θ ), as desired. Since ( 44) is unbiased shown in Lemma 6, we can use it for the averaged SGD algorithm to minimize L A , called NPG-SGD in Algorithm 5 (also see Agarwal et al., 2021, Algorithm 4) . Similar to Algorithm 5, once we obtain the sampled (s, a) and Q s,a (θ) from Algorithm 3, we can apply the averaged SGD algorithm to solve (18) of Q-NPG.

Recall the compatible function approximation L

Q in (18) L Q (w, θ, d θ ) = E (s,a)∼ d θ w φ s,a -Q s,a (θ) 2 . With the output (s, a) ∼ d θ and Q s,a (θ) from Algorithm 3, we compute the stochastic gradient estimator of the function L Q in (18) by ∇ w L Q (w, θ, d θ ) def = 2 w φ s,a -Q s,a (θ) φ s,a , and use it for the averaged SGD algorithm to minimize L Q , called Q-NPG-SGD in Algorithm 6 (also see Agarwal et al., 2021, Algorithm 2) . Compared to (44), the cost of computing ( 47) is |A| times cheaper than that of computing (47). Indeed, to compute (47), we only need one single action for φ s,a , while to compute (44), one needs to go through all the actions to compute φs,a (θ). Thus, the computational cost of Q-NPG-SGD is |A| times cheaper than that of NPG-SGD. The estimator ∇ w L Q (w, θ, d θ ) is also unbiased following the similar argument of the proof of Lemma 6. We formalize this in the following and omit the proof. Lemma 7. Consider the output (s, a) and Q s,a (θ) of Algorithm 3 and the stochastic gradient (47). It follows that E ∇ w L Q (w, θ, d θ ) = ∇ w L Q (w, θ, d θ ), where the expectation is with respect to the randomness in the sequence of the sampled s 0 , a 0 , • • • , s t , a t from Algorithm 3. Algorithm 6: Q-NPG-SGD Input: Number of iterations T , step size α > 0, initialization w 0 ∈ R m , initial state-action measure ν, policy π(θ), discounted factor γ ∈ [0, 1) for t = 0 to T -1 do Call Algorithm 3 with the inputs (ν, π(θ), γ) to sample (s, a) ∼ d θ and Q s,a (θ) Update w t+1 = w t -α ∇ w L Q (w, θ, d θ ) by using ( 47) SECTION 4 Throughout this section and the next, we use the shorthand Output: w out = 1 T T t=1 w t D PROOF OF V (k) ρ for V ρ (θ (k) ) and similarly, Q (k) s,a for Q s,a (θ (k) ) and A (k) s,a for A s,a (θ (k) ). We also use the shorthand Q (k) s for the vector Q (k) s,a a∈A ∈ R |A| and A (k) s for the vector A (k) s,a a∈A ∈ R |A| . We also introduce a weighted KL divergence given by D * k def = E s∼d * D(π * s , π (k) s ) . If we choose the uniform initial policy, i.e., π s,a = 1/|A| for all (s, a) ∈ S × A (or θ (0) = 0), then D * 0 ≤ log |A|, for all ρ ∈ ∆(S) and for any π * ∈ ∆(A) S . We first provide the one step analysis of the Q-NPG update, which will be helpful for proving Theorem 1 and 3 and the sublinear convergence result of Theorem 2 further in Appendix D.3.

D.1 THE ONE STEP Q-NPG LEMMA

The following one step analysis of Q-NPG is based on the mirror descent approach of Xiao (2022) . Lemma 8 (One step Q-NPG lemma). Fix a state distribution ρ; an initial state-action distribution ν; an arbitrary comparator policy π * . Let w (k) ∈ argmin w L Q (w, θ (k) , d (k) ) denote the exact minimizer. Consider the w (k) and π (k) given in (18) and ( 16) respectively. We have that ϑ ρ (1 -γ) V (k+1) ρ -V (k) ρ + (1 -γ) V (k) ρ -V ρ (π * ) + ϑ ρ s∈S a∈A d (k+1) s π (k+1) s,a φ s,a w (k) -w (k) 1 + s∈S a∈A d (k+1) s π (k+1) s,a φ s,a w (k) -Q (k) s,a 2 + s∈S a∈A d (k+1) s π (k) s,a φ s,a w (k) -w (k) 3 + s∈S a∈A d (k+1) s π (k) s,a Q (k) s,a -φ s,a w (k) 4 + (s,a)∈S×A d * s π (k) s,a φ s,a w (k) -w (k) a + (s,a)∈S×A d * s π (k) s,a φ s,a w (k) -Q (k) s,a b + (s,a)∈S×A d * s π * s,a φ s,a w (k) -w (k) c + (s,a)∈S×A d * s π * s,a Q (k) s,a -φ s,a w (k) d ≤ 1 η k D * k - 1 η k D * k+1 . Proof. In the context of the PMD method ( 16), we apply the three-point descent lemma (Lemma 11) with C = ∆(A), f is the linear function η k Φ s w (k) , • and h : ∆(A) → R is the negative entropy with h(p) = a∈A p a log p a . Thus, h is of Legendre type with rint dom h ∩ C = rint ∆(A) = ∅ and D h (•, •) is the KL divergence D(•, •). From Lemma 11, we obtain that for any p ∈ ∆(A), we have η k Φ s w (k) , π (k+1) s + D(π (k+1) s , π (k) s ) ≤ η k Φ s w (k) , p + D(p, π (k) s ) -D(p, π (k+1) s ). Rearranging terms and dividing both sides by η k , we get Φ s w (k) , π (k+1) s -p + 1 η k D(π (k+1) s , π (k) s ) ≤ 1 η k D(p, π (k) s ) - 1 η k D(p, π (k+1) s ). ( ) Letting p = π (k) s yields Φ s w (k) , π (k+1) s -π (k) s ≤ - 1 η k D(π (k+1) s , π (k) s ) - 1 η k D(π (k) s , π (k+1) s ) ≤ 0. ( ) Letting p = π * s and subtract and add π (k) s within the inner product term in (50) yields Φ s w (k) , π (k+1) s -π (k) s + Φ s w (k) , π (k) s -π * s ≤ 1 η k D(π * s , π (k) s ) - 1 η k D(π * s , π (k+1) s ). Note that we dropped the nonnegative term 1 η k D(π (k+1) s , π s ) on the left hand side to the inequality. Taking expectation with respect to the distribution d * , we have E s∼d * Φ s w (k) , π (k+1) s -π (k) s + E s∼d * Φ s w (k) , π (k) s -π * s ≤ 1 η k D * k - 1 η k D * k+1 . ( ) For the first expectation in (52), we have E s∼d * Φ s w (k) , π (k+1) s -π (k) s = s∈S d * s Φ s w (k) , π (k+1) s -π (k) s = s∈S d * s d (k+1) s d (k+1) s Φ s w (k) , π (k+1) s -π (k) s ≥ ϑ k+1 s∈S d (k+1) s Φ s w (k) , π (k+1) s -π (k) s ≥ ϑ ρ s∈S d (k+1) s Φ s w (k) , π (k+1) s -π (k) s = ϑ ρ s∈S d (k+1) s Q (k) s , π (k+1) s -π (k) s + ϑ ρ s∈S d (k+1) s Φ s w (k) -Q (k) s , π (k+1) s -π (k) s = ϑ ρ (1 -γ) V (k+1) ρ -V (k) ρ + ϑ ρ s∈S d (k+1) s Φ s w (k) -Q (k) s , π (k+1) s -π (k) s , ( ) where the last equality is due to the performance difference lemma (42) in Lemma 3 and the two inequalities above are obtained by the negative sign of Φ s w (k) , π (k+1) s -π (k) s shown in (51) and by using the following inequality d * s d (k+1) s (20) ≤ ϑ k+1 (20) ≤ ϑ ρ . The second term of ( 53) can be decomposed into four terms. That is, s∈S d (k+1) s Φ s w (k) -Q (k) s , π (k+1) s -π (k) s = s∈S a∈A d (k+1) s π (k+1) s,a φ s,a w (k) -Q (k) s,a + s∈S a∈A d (k+1) s π (k) s,a Q (k) s,a -φ s,a w (k) = s∈S a∈A d (k+1) s π (k+1) s,a φ s,a w (k) -w (k) + s∈S a∈A d (k+1) s π (k+1) s,a φ s,a w (k) -Q (k) s,a + s∈S a∈A d (k+1) s π (k) s,a φ s,a w (k) -w (k) + s∈S a∈A d (k+1) s π (k) s,a Q (k) s,a -φ s,a w (k) = 1 + 2 + 3 + 4 , ( ) where 1 , 2 , 3 and 4 are defined in (49). For the second expectation in ( 52), by applying again the performance difference lemma (42), we have E s∼d * Φ s w (k) , π (k) s -π * s = E s∼d * Q (k) s , π (k) s -π * s + E s∼d * Φ s w (k) -Q (k) s , π (k) s -π * s (42) = (1 -γ) V (k) ρ -V ρ (π * ) + E s∼d * Φ s w (k) -Q (k) s , π (k) s -π * s . Similarly, we decompose the second term of ( 55) into four terms. That is, E s∼d * Φ s w (k) -Q (k) s , π (k) s -π * s = s∈S a∈A d * s π (k) s,a φ s,a w (k) -Q (k) s,a + s∈S a∈A d * s π * s,a Q (k) s,a -φ s,a w (k) = (s,a)∈S×A d * s π (k) s,a φ s,a w (k) -w (k) + (s,a)∈S×A d * s π (k) s,a φ s,a w (k) -Q (k) s,a + (s,a)∈S×A d * s π * s,a φ s,a w (k) -w (k) + (s,a)∈S×A d * s π * s,a Q (k) s,a -φ s,a w (k) = a + b + c + d , where a , b , c and d are defined in (49). Plugging (53) with the decomposition ( 54) and ( 55) with the decomposition (56) into (52) concludes the proof. Consequently, the convergence analysis of Q-NPG (Theorem 1, 3 and Theorem 2 further in Appendix D.3) will be obtained by upper bounding the absolute values of 1 , 2 , 3 , 4 , a , b , c , d in ( 49) with different set of assumptions (assumptions in Theorem 1 or assumptions in Theorem 3) and with different step size scheme (geometrically increasing step size for Theorem 1 and 3 or constant step size for Theorem 2).

D.2 PROOF OF THEOREM 1

Here we provide the general Q-NPG convergence analysis with any initial policy, which is in contrast to Theorem 1 with the uniform initial policy. Theorem 5. Fix a state distribution ρ, an state-action distribution ν and a comparator policy π * . We consider the Q-NPG method (18) with the step sizes satisfying η 0 ≥ 1-γ γ D * 0 and η k+1 ≥ 1 γ η k . Suppose that Assumptions 1, 2, 3 and 4 all hold. Then we have for all k ≥ 0, E V ρ (π (k) ) -V ρ (π * ) ≤ 1 - 1 ϑ ρ k 2 1 -γ + 2 |A| ϑ ρ C ρ + 1 1 -γ κ ν 1 -γ stat + √ bias . Remark. A deviation from the setting of Theorem 1 is that here we require the step sizes satisfying η 0 ≥ 1-γ γ D * 0 instead of η 0 ≥ 1-γ γ log |A|. This is because when using uniform initial policy in Theorem 1, from (48), we know that η 0 ≥ 1-γ γ log |A| is a sufficient condition for η 0 ≥ 1-γ γ D * 0 . Consequently, to prove Theorem 1, it suffices to prove the more general Theorem 5. Proof. From ( 49  | 1 | ≤ s∈S a∈A d (k+1) s π (k+1) s,a φ s,a w (k) -w (k) ≤ (s,a)∈S×A d (k+1) s 2 π (k+1) s,a 2 d * s • Unif A (a) • (s,a)∈S×A d * s • Unif A (a) φ s,a w (k) -w (k) 2 (23) = (s,a)∈S×A d (k+1) s 2 π (k+1) s,a 2 d * s • Unif A (a) w (k) -w (k) 2 Σ d * ≤ E s∼d *   d (k+1) s d * s 2   |A| w (k) -w (k) 2 Σ d * (25) ≤ C ρ |A| w (k) -w (k) 2 Σ d * , where the second inequality is obtained by Cauchy-Schwartz's inequality, and the third inequality is obtained by the following inequality a∈A π (k+1) s,a 2 ≤ a∈A π (k+1) s,a = 1. ( ) Then, by using Assumption 3 with the definition of κ ν , ( 57) is upper bounded by | 1 | (24) ≤ C ρ |A|κ ν w (k) -w (k) 2 Σν (5) ≤ C ρ |A|κ ν 1 -γ w (k) -w (k) 2 Σ d (k) , ( ) where we use the shorthand Σ d (k) def = E (s,a)∼ d (k) φ s,a φ s,a . Besides, by the first-order optimality conditions for the optima w (k) ∈ argmin w L Q (w, θ (k) , d (k) ), we have (w -w (k) ) ∇ w L Q (w (k) , θ (k) , d (k) ) ≥ 0, for all w ∈ R m . (61) Therefore, for all w ∈ R m , L Q (w, θ (k) , d (k) ) -L Q (w (k) , θ (k) , d (k) ) = E (s,a)∼ d (k) φ s,a w -φ s,a w (k) + φ s,a w (k) -Q (k) s,a 2 -L Q (w (k) , θ (k) , d (k) ) = E (s,a)∼ d (k) (φ s,a w -φ s,a w (k) ) 2 + 2(w -w (k) ) E (s,a)∼ d (k) (φ s,a w (k) -Q (k) s,a )φ s,a = w -w (k) 2 Σ d (k) + (w -w (k) ) ∇ w L Q (w (k) , θ (k) , d (k) ) (61) ≥ w -w (k) 2 Σ d (k) . ( ) Define (k) stat def = L Q (w (k) , θ (k) , d (k) ) -L Q (w (k) , θ (k) , d (k) ). Note that from (19), we have E (k) stat ≤ stat . Plugging ( 62) into (59), we have | 1 | ≤ C ρ |A|κ ν 1 -γ (k) stat . ( ) Similar to (57), we get the same upper bound for | 3 | by just replacing π (k+1) s,a into π (k) s,a . That is, | 3 | ≤ C ρ |A|κ ν 1 -γ (k) stat . ( ) To upper bound | 2 | and | 4 |, we introduce the following term (k) bias def = L Q (w (k) , θ (k) , d * ). Note that from ( 22), we have E (k) bias ≤ bias . ( ) By Cauchy-Schwartz's inequality, we have | 2 | ≤ s∈S a∈A d (k+1) s π (k+1) s,a φ s,a w (k) -Q (k) s,a ≤ (s,a)∈S×A d (k+1) s 2 π (k+1) s,a 2 d * s • Unif A (a) • (s,a)∈S×A d * s • Unif A (a) φ s,a w (k) -Q (k) s,a 2 = (s,a)∈S×A d (k+1) s 2 π (k+1) s,a 2 d * s • Unif A (a) • (k) bias (58) ≤ E s∼d *   d (k+1) s d * s 2   |A| (k) bias (25) ≤ C ρ |A| (k) bias . ( ) Similar to (67), we get the same upper bound for | 4 | by just replacing π (k+1) s,a into π (k) s,a . That is, | 4 | ≤ C ρ |A| (k) bias . Next, we will upper bound the absolute values of a , b , c and d of (49) separately by using again the statistical error (19) and by using the transfer error assumption ( 22). Indeed, to upper bound | a |, by Cauchy-Schwartz's inequality, we have | a | ≤ (s,a)∈S×A d * s π (k) s,a φ s,a w (k) -w (k) ≤ (s,a)∈S×A (d * s ) 2 π (k) s,a 2 d * s • Unif A (a) (s,a)∈S×A d * s • Unif A (a) φ s,a w (k) -w (k) 2 (23) = (s,a)∈S×A (d * s ) 2 π (k) s,a 2 d * s • Unif A (a) w (k) -w (k) 2 Σ d * (58) ≤ |A| w (k) -w (k) 2 Σ d * . From the definition of κ ν , we further obtain | a | (24) ≤ |A|κ ν w (k) -w (k) 2 Σν (5) ≤ |A|κ ν 1 -γ w (k) -w (k) 2 Σ d (k) (62) ≤ |A|κ ν 1 -γ (k) stat . ( ) Similar to (69), we get the same upper bound for | c | by just replacing π (k) s,a into π * s,a . That is, | c | ≤ |A|κ ν 1 -γ (k) stat . ( ) To upper bound | b |, by Cauchy-Schwartz's inequality, we have | b | ≤ (s,a)∈S×A d * s π (k) s,a φ s,a w (k) -Q (k) s,a ≤ (s,a)∈S×A (d * s ) 2 π (k) s,a 2 d * s • Unif A (a) (s,a)∈S×A d * s • Unif A (a) φ s,a w (k) -Q (k) s,a 2 = (s,a)∈S×A (d * s ) 2 π (k) s,a 2 d * s • Unif A (a) (k) bias (58) ≤ |A| (k) bias . ( ) Similar to (71), we get the same upper bound for | d | by just replacing π (k) s,a into π * s,a . That is, | d | ≤ |A| (k) bias . Plugging all the upper bounds (64 ) of | 1 |, (67) of | 2 |, (65) of | 3 |, (68) of | 4 |, (69) of | a |, (71) of | b |, (70) of | c | and (72) of | d | into (49) yields ϑ ρ (δ k+1 -δ k ) + δ k ≤ D * k (1 -γ)η k - D * k+1 (1 -γ)η k + 2 |A| ϑ ρ C ρ + 1 1 -γ κ ν 1 -γ (k) stat + (k) bias , where δ k def = V (k) ρ -V ρ (π * ). Dividing both sides by ϑ ρ and rearranging terms, we get δ k+1 + D * k+1 (1 -γ)η k ϑ ρ ≤ 1 - 1 ϑ ρ δ k + D * k (1 -γ)η k (ϑ ρ -1) + 2 |A| C ρ + 1 ϑρ 1 -γ κ ν 1 -γ (k) stat + (k) bias . If the step sizes satisfy η k+1 (ϑ ρ -1) ≥ η k ϑ ρ , which is implied by η k+1 ≥ η k /γ and (20), then δ k+1 + D * k+1 (1 -γ)η k+1 (ϑ ρ -1) ≤ 1 - 1 ϑ ρ δ k + D * k (1 -γ)η k (ϑ ρ -1) + 2 |A| C ρ + 1 ϑρ 1 -γ κ ν 1 -γ (k) stat + (k) bias ≤ 1 - 1 ϑ ρ k+1 δ 0 + D * 0 (1 -γ)η 0 (ϑ ρ -1) + k t=0 1 - 1 ϑ ρ k-t 2 |A| C ρ + 1 ϑρ 1 -γ κ ν 1 -γ (t) stat + (t) bias . Finally, by choosing η 0 ≥ 1-γ γ D * 0 and using the fact that (1 -γ)(ϑ ρ -1) (20) ≥ (1 -γ) 1 1 -γ -1 = γ, we obtain δ k ≤ δ k + D * k (1 -γ)η k ϑ ρ ≤ 1 - 1 ϑ ρ k 2 1 -γ + 2 |A| C ρ + 1 ϑρ 1 -γ k-1 t=0 1 - 1 ϑ ρ k-1-t κ ν 1 -γ (t) stat + (t) bias . Taking the total expectation with respect to the randomness in the sequence of the iterates w (0) , • • • , w (k-1) , we have E V ρ (π (k) ) -V ρ (π * ) ≤ 1 - 1 ϑ ρ k 2 1 -γ + 2 |A| C ρ + 1 ϑρ 1 -γ k-1 t=0 1 - 1 ϑ ρ k-1-t E κ ν 1 -γ (t) stat + E (t) bias ≤ 1 - 1 ϑ ρ k 2 1 -γ + 2 |A| C ρ + 1 ϑρ 1 -γ k-1 t=0 1 - 1 ϑ ρ k-1-t κ ν 1 -γ E (t) stat + E (t) bias (63)+(66) ≤ 1 - 1 ϑ ρ k 2 1 -γ + 2 |A| C ρ + 1 ϑρ 1 -γ k-1 t=0 1 - 1 ϑ ρ k-1-t κ ν 1 -γ stat + √ bias ≤ 1 - 1 ϑ ρ k 2 1 -γ + 2 |A| ϑ ρ C ρ + 1 1 -γ κ ν 1 -γ stat + √ bias , where the second inequality is obtained by Jensen's inequality. This concludes the proof.

D.3 PROOF OF THEOREM 2

Similar to Theorem 5, to prove Theorem 2, it suffices to prove the following sublinear convergence of Q-NPG with any initial policy. Theorem 6. Fix a state distribution ρ, an state-action distribution ν and an optimal policy π * . We consider the Q-NPG method (18) with any constant step size η k = η > 0. Suppose that Assumptions 1, 2, 3 and 4 all hold. Then we have for all k ≥ 0, 1 k k-1 t=0 E V ρ (π (t) ) -V ρ (π * ) ≤ 1 (1 -γ)k D * 0 η + 2ϑ ρ + 2 |A| ϑ ρ C ρ + 1 1 -γ κ ν 1 -γ stat + √ bias . Remark. A deviation from the setting of Theorem 1 is that here we require π * to be an optimal policyfoot_4 . Compared to Theorem 20 in Agarwal et al. (2021) , our convergence rate is also sublinear, but with an improved convergence rate of O(1/k), as opposed to O(1/ √ k). Moreover, they use a diminishing step size of order O(1/ √ k) while our constant step size is unconstrained. Proof. By (73) and using a constant step size η, we have ϑ ρ (δ k+1 -δ k ) + δ k ≤ D * k (1 -γ)η - D * k+1 (1 -γ)η + 2 |A| ϑ ρ C ρ + 1 1 -γ κ ν 1 -γ (k) stat + (k) bias . Taking the total expectation with respect to the randomness in the sequence of the iterates w (0) , • • • , w (k-1) , summing up from 0 to k -1 and rearranging terms, we have ϑ ρ E [δ k ] + k-1 t=0 E [δ t ] ≤ D * 0 (1 -γ)η + ϑ ρ δ 0 + k • 2 |A| ϑ ρ C ρ + 1 1 -γ κ ν 1 -γ stat + √ bias , where we use the following inequalities E (t) stat ≤ E (t) stat (63) ≤ √ stat , E bias ≤ E (t) bias (66) ≤ √ bias . Finally, dropping the positive term E [δ k ] on the left hand side as π * is the optimal policy and dividing both side by k yields 1 k k-1 t=0 E V ρ (π (t) ) -V ρ (π * ) ≤ D * 0 (1 -γ)ηk + 2ϑ ρ (1 -γ)k + 2 |A| ϑ ρ C ρ + 1 1 -γ κ ν 1 -γ stat + √ bias .

D.4 PROOF OF THEOREM 3

Similar to Theorem 5, to prove Theorem 3, it suffices to prove the following Q-NPG result with any initial policy. Theorem 7. Fix a state distribution ρ, an state-action distribution ν and a comparator policy π * . We consider the Q-NPG method (18) with the step sizes satisfying η 0 ≥ 1-γ γ D * 0 and η k+1 ≥ 1 γ η k . Suppose that Assumptions 1, 5 and 6 hold. Then we have for all k ≥ 0, E V ρ (π (k) ) -V ρ (π * ) ≤ 1 - 1 ϑ ρ k 2 1 -γ + 2 √ C ν (ϑ ρ + 1) 1 -γ √ stat + √ approx . Proof. Similar to the proof of Theorem 1, by Lemma 8, we upper bound the absolute values of 1 , 2 , 3 , 4 , a , b , c , d introduced in (49), separately, with the set of assumptions in Theorem 3. In comparison with the proof of Theorem 1, we will also To upper bound | 1 |, by Cauchy-Schwartz's inequality, we get | 1 | ≤ s∈S a∈A d (k+1) s π (k+1) s,a φ s,a w (k) -w (k) ≤ (s,a)∈S×A d (k+1) s 2 π (k+1) s,a 2 d (k) s,a • (s,a)∈S×A d (k) s,a φ s,a w (k) -w (k) 2 (60) = E (s,a)∼ d (k)   d (k+1) s π (k+1) s,a d (k) s,a 2   w (k) -w (k) 2 Σ d (k) (28) ≤ C ν w (k) -w (k) 2 Σ d (k) Similar to | 1 |, by using Assumption 6 and Cauchy-Schwartz's inequality, and by simply replacing π (k+1) into π (k) or π * and replacing d (k+1) into d * , we obtain the same upper bound of | 3 |, | a | and | c |, that is | 3 |, | a |, | c | ≤ C ν (k) stat . Next, we define (k) approx def = L Q (w (k) , θ (k) , d (k) ) By Assumption 5, we know that E (k) approx ≤ approx . To upper bound | 2 |, by Cauchy-Schwartz's inequality, we have | 2 | ≤ s∈S a∈A d (k+1) s π (k+1) s,a φ s,a w (k) -Q (k) s,a ≤ (s,a)∈S×A d (k+1) s 2 π (k+1) s,a 2 d (k) s,a • (s,a)∈S×A d (k) s,a φ s,a w (k) -Q (k) s,a 2 = E (s,a)∼ d(k)   d (k+1) s π (k+1) s,a d (k) s,a 2   • (k) approx (28) ≤ C ν (k) approx . Similar to | 2 |, by using Assumption 5 and Cauchy-Schwartz's inequality, and by simply replacing π (k+1) into π (k) or π * and replacing d (k+1) into d * , we obtain the same upper bound for | 4 |, | b | and | d |, that is | 4 |, | b |, | d | ≤ C ν (k) approx . Consequently, plugging all these upper bounds into (49) leads to the following recurrent inequality ϑ ρ (δ k+1 -δ k ) + δ k ≤ D * k (1 -γ)η k - D * k+1 (1 -γ)η k + 2 √ C ν (ϑ ρ + 1) 1 -γ (k) stat + (k) approx . By using the same increasing step size as in Theorem 5 and following the same arguments in the proof of Theorem 5 after (73), we obtain the final performance bound with the linear convergence rate E V ρ (π (k) ) -V ρ (π * ) ≤ 1 - 1 ϑ ρ k 2 1 -γ + 2 √ C ν (ϑ ρ + 1) 1 -γ √ stat + √ approx .

D.5 PROOF OF COROLLARY 1

In this section, we aim to prove Corollary 1 presented in Section 4.3. Corollary 2 (Corollary 1). Consider the setting of Theorem 3. Suppose that the sample-based Q-NPG Algorithm 2 is run for K iterations, with T gradient steps of Q-NPG-SGD (Algorithm 6) per iteration. Furthermore, suppose that for all (s, a) ∈ S × A, we have φ s,a ≤ B with B > 0, and we choose the step size α = 1 2B 2 and the initialization w 0 = 0 for Q-NPG-SGD. If for all θ ∈ R m , the covariance matrix of the feature map followed by the initial state-action distribution ν satisfies E (s,a)∼ν φ s,a φ s,a (23) = Σ ν ≥ µI m , where I m ∈ R m×m is the identity matrix and µ > 0, then E[V ρ (π (K) )] -V ρ (π * ) ≤ 1 - 1 ϑ ρ K 2 1 -γ + 2(ϑ ρ + 1) C ν approx 1 -γ + 4 √ C ν (ϑ ρ + 1) (1 -γ) 3 √ T B 2 µ ( √ 2m + 1) + (1 -γ) √ 2m . In order to better understand our proof, we first identify an issue appeared in the sample complexity analysis of Q-NPG in Agarwal et al. (2021, Corollay 26) . Agarwal et al. (2021) adopts the optimization results of Shalev-Shwartz & Ben-David (2014, Theorem 14.8) where the stochastic gradient ∇L Q (w, θ, d θ ) in (47) needs to be boundedfoot_5 . However, although they consider a projection step for the iterate w t and assume that the feature map φ s,a is bounded, ∇L Q (w, θ, d θ ) is still not guaranteed to be bounded. Indeed, recall the stochastic gradient of the function L Q in (47) ∇ w L Q (w, θ, d θ ) = 2 w φ s,a -Q s,a (θ) φ s,a . They incorrectly use the argument that w, φ s,a and Q s,a (θ) are bounded to imply that ∇ w L Q (w, θ, d θ ) is bounded. In fact, Q s,a (θ) can be unbounded even though E Q s,a (θ) = Q s,a (θ) ∈ 0, 1 1-γ is bounded. To see this, we can rewrite Q s,a (θ) from (43) as Q s,a (θ) = H t=0 c(s t , a t ), with (s 0 , a 0 ) = (s, a) ∼ d θ and H is the length of the sampled trajectory for estimating Q s,a (θ) in Algorithm 3. From Algorithm 3 and from the proof of Lemma 4, we know that the probability of H = k + 1 is that Pr(H = k + 1) = (1 -γ)γ k . So, with exponentially decreasing low probability, H can be unbounded. Consequently, | Q s,a (θ)| upper bounded by H is not guaranteed to be bounded. Proof sketch. Instead, we adopt the optimization results of Bach & Moulines (2013, Theorem 1) (see also Theorem 12), which does not require the boundedness of the stochastic gradient. However, in our following proof, we can verify that E Q s,a (θ) 2 is bounded even though Q s,a (θ) is unbounded. As to verify the condition (vi) in Theorem 12 in our proof, i.e., the covariance of the stochastic gradient at the optimum is upper bounded by the covariance of the feature map up to a finite constant, we use a conditional expectation argument to separate the correlated random variables (s, a) ∼ d θ and Q s,a (θ) appeared in the stochastic gradient. Non-singularity of the covariance matrix Σ ν . As for the condition (31), it is shown in Cayci et al. (2021, Proposition 3 ) that with ν chosen as uniform distribution over S × A and φ s,a ∼ N (0, I m ) sampled as Gaussian random features, ( 31) is guaranteed with high probability. More generally, with m |S||A|, it is easy to find m linearly independent φ s,a among all |S||A| features such that the covariance matrix Σ ν has full rank. Proof. From Theorem 3, it remains to upper bound the statistical error √ stat produced from the Q-NPG-SGD procedure (Algorithm 6) for each iteration k. We suppress the superscript (k). Let w out be the output of T steps Q-NPG-SGD with the constant step size 1 2B 2 and the initialization w 0 = 0, and let w ∈ argmin w L Q (w, θ, d θ ) be the exact minimizer. To upper bound stat from (19), we aim to apply the standard analysis for the averaged SGD, i.e., Theorem 12. Now we verify all the assumptions in order for Q-NPG-SGD. First, (i) is verified by considering the Euclidean space H = R m . The observations φ s,a , Q s,a (θ)φ s,a ∈ R m × R m are independent and identically distributed, sampled from Algorithm 3. Thus, (ii) is verified with x n = φ s,a ∈ R m and z n = Q s,a (θ)φ s,a ∈ R m . As the feature map φ s,a ≤ B, we have E φ s,a 2 finite. From (31), we know that the covariance E φ s,a φ s,a is invertible. To verify (iii), it remains to verify that E Q s,a (θ)φ s,a 2 is finite. Indeed, by using φ s,a ≤ B, we have E Q s,a (θ)φ s,a 2 ≤ B 2 E Q s,a (θ) 2 . Thus, it remains to show E Q s,a (θ) 2 finite for (iii). From (43), we rewrite Q s,a (θ) as Q s,a (θ) = H t=0 c(s t , a t ), with (s 0 , a 0 ) = (s, a) ∼ d θ and H is the length of the trajectory for estimating Q s,a (θ). Thus, (iii) is verified as the variance of Q s,a (θ) is upper bounded by E Q s,a (θ) 2 = E (s,a)∼ d θ   ∞ k=0 Pr(H = k)E   k t=0 c(s t , a t ) 2 | H = k, s 0 = s, a 0 = a     = E (s,a)∼ d θ   (1 -γ) ∞ k=0 γ k E   k t=0 c(s t , a t ) 2 | H = k, s 0 = s, a 0 = a     ≤ E (s,a)∼ d θ (1 -γ) ∞ k=0 γ k (k + 1) 2 ≤ 2 (1 -γ) 2 , ( ) where the first inequality is obtained as |c(s t , a t )| ∈ [0, 1] for all (s t , a t ) ∈ S × A. Next, we introduce the residual ξ def = Q s,a (θ) -w φ s,a φ s,a = 1 2 ∇ w L Q (w , θ, d θ ). From Lemma 7, we know that E ∇ w L Q (w , θ, d θ ) = ∇ w L Q (w , θ, d θ ). So, we have that E [ξ] = 1 2 ∇ w L Q (w , θ, d θ ) = 0, where the last equality is obtained as w is the exact minimizer of the loss function L Q . Thus, (iv) is verified with that f is 1 2 L Q , ξ n is ξ and θ is w in our context. From Q-NPG-SGD update 47, we have (v) verified with step size α/2 in our context. Finally, for (vi), from the boundedness of the feature map φ s,a ≤ B, we take R = B such that E φ s,a 2 φ s,a φ s,a ≤ B 2 E φ s,a φ s,a . It remains to find σ > 0 such that E ξξ ≤ σ 2 E φ s,a φ s,a . We rewrite the covariance of ξ as E ξξ (75) = E Q s,a (θ) -w φ s,a 2 φ s,a φ s,a = E (s,a)∼ d θ Q s,a (θ) -w φ s,a 2 φ s,a φ s,a | s, a = E (s,a)∼ d θ E Q s,a (θ) -w φ s,a 2 | s, a φ s,a φ s,a . Thus, it suffices to find σ > 0 such that E Q s,a (θ) -w φ s,a 2 | s, a = E Q s,a (θ) 2 | s, a -2Q s,a (θ)w φ s,a + w φ s,a 2 ≤ σ 2 (76) for all (s, a) ∈ S × A to verify (vi). Besides, we know that E Q s,a (θ) 2 | s, a (74) ≤ 2 (1 -γ) 2 . We also know that |Q s,a (θ)| ≤ 1 1-γ and φ s,a ≤ B. Now we need to bound w . Again, since w is the exact minimizer, we have ∇ w L Q (w , θ, d θ ) = 0. That is E (s,a)∼ d θ w φ s,a -Q s,a (θ) φ s,a = 0, which implies w = E (s,a)∼ d θ φ s,a φ s,a † E (s,a)∼ d θ [Q s,a (θ)φ s,a ] (5) ≤ 1 1 -γ E (s,a)∼ν φ s,a φ s,a † E (s,a)∼ d θ [Q s,a (θ)φ s,a ] . By the boundness of the feature map φ s,a ≤ B and the Q-function |Q s,a (θ)| ≤ 1 1-γ , and the condition (31), we have the minimizer w bounded by w (31) ≤ B µ(1 -γ) 2 . By using the upper bounds of E Q s,a (θ) 2 | s, a , |Q s,a (θ)|, w and φ s,a , the left hand side of (76) can be upper bounded by E Q s,a (θ) -w φ s,a 2 | s, a ≤ 2 (1 -γ) 2 + 2B 2 µ(1 -γ) 3 + B 4 µ 2 (1 -γ) 4 = 1 (1 -γ) 2 B 2 µ(1 -γ) + 1 2 + 1 ≤ 2 (1 -γ) 2 B 2 µ(1 -γ) + 1 2 . Thus, in order to satisfy (76), we choose σ = √ 2 1 -γ B 2 µ(1 -γ) + 1 . Now all the conditions (i) -(vi) in Theorem 12 are verified. With step size α = 1 2B 2 , the initialization w 0 = 0 and T steps of Q-NPG-SGD updates (47), we have E L Q (w out , θ, d θ ) -L Q (w , θ, d θ ) ≤ 4 T σ √ m + B w 2 ≤ 4 T √ 2m 1 -γ B 2 µ(1 -γ) + 1 + B 2 µ(1 -γ) 2 2 . Consequently, Assumption 1 is verified by √ stat ≤ 2 (1 -γ) √ T B 2 µ(1 -γ) √ 2m + 1 + √ 2m . The proof is completed by replacing the above upper bound of √ stat in the results of Theorem 3.

E PROOF OF SECTION 5 E.1 THE ONE STEP NPG LEMMA

To prove Theorem 4 and the sublinear convergence result of Theorem 9 further in Appendix F, we start from providing the one step analysis of the NPG update. Lemma 9 (One step NPG lemma). Fix a state distribution ρ; an initial state-action distribution ν; an arbitrary comparator policy π * . At the k-th iteration, let w (k) ∈ argmin w L A (w, θ (k) , d (k) ) denote the exact minimizer. Consider the w (k) and π (k) NPG iterates given in (32) and (17) respectively. Note (k) stat def = L A (w (k) , θ (k) , d (k) ) -L A (w (k) , θ (k) , d (k) ), ( ) (k) approx def = L A (w (k) , θ (k) , d (k) ), δ k def = V (k) ρ -V ρ (π * ). If Assumptions 7, 8 and 9 hold for all k ≥ 0, then we have that ϑ ρ (δ k+1 -δ k ) + δ k ≤ D * k (1 -γ)η k - D * k+1 (1 -γ)η k + √ C ν (ϑ ρ + 1) 1 -γ (k) stat + (k) approx . (79) Proof. From the three-point descent lemma (Lemma 11) and ( 17), we obtain that for any p ∈ ∆(A), we have η k Φ(k) s w (k) , π (k+1) s + D(π (k+1) s , π (k) s ) ≤ η k Φ(k) s w (k) , p + D(p, π (k) s ) -D(p, π ). Rearranging terms and dividing both sides by η k , we get Φ(k) s w (k) , π (k+1) s -p + 1 η k D(π (k+1) s , π (k) s ) ≤ 1 η k D(p, π (k) s ) - 1 η k D(p, π . Letting p = π (k) s and knowing that Φ(k) s w (k) , π (k) s = 0 for all k ≥ 0, which is due to (12), we have Φ(k) s w (k) , π (k+1) s ≤ - 1 η k D(π (k+1) s , π (k) s ) - 1 η k D(π (k) s , π (k+1) s ) ≤ 0. ( ) Letting p = π * s yields Φ(k) s w (k) , π (k+1) s -π * s ≤ 1 η k D(π * s , π (k) s ) - 1 η k D(π * s , π ). Note that we dropped the nonnegative term 1 η k D(π (k+1) s , π s ) on the left hand side to the inequality. Taking expectation with respect to the distribution d * , we have E s∼d * Φ(k) s w (k) , π (k+1) s -E s∼d * Φ(k) s w (k) , π * s ≤ 1 η k D * k - 1 η k D * k+1 . For the first expectation in (81), we have E s∼d * Φ(k) s w (k) , π (k+1) s = s∈S d * s Φ(k) s w (k) , π (k+1) s = s∈S d * s d (k+1) s d (k+1) s Φ(k) s w (k) , π (k+1) s (20)+(80) ≥ ϑ k+1 s∈S d (k+1) s Φ(k) s w (k) , π (k+1) s (20)+(80) ≥ ϑ ρ s∈S d (k+1) s Φ(k) s w (k) , π (k+1) s = ϑ ρ E (s,a)∼ d (k+1) ( φ(k) s,a ) w (k) = ϑ ρ E (s,a)∼ d (k+1) A (k) s,a + ϑ ρ E (s,a)∼ d (k+1) ( φ(k) s,a ) w (k) -A (k) s,a = ϑ ρ (1 -γ) V (k+1) ρ -V (k) ρ + ϑ ρ E (s,a)∼ d (k+1) ( φ(k) s,a ) w (k) -A (k) s,a , where the last line is obtained by the performance difference lemma (41), and we use the shorthand φ(k) s,a as φs,a (θ (k) ). The second term of (82) can be lower bounded. To do it, we first decompose it into two terms. That is, To upper bound 1 , we first define the following covariance matrix of the centered feature map E (s,a)∼ d (k+1) ( φ(k) s,a ) w (k) -A (k) s,a = E (s,a)∼ d (k+1) ( φ(k) s,a ) (w (k) -w (k) ) 1 + E (s,a)∼ d (k+1) ( φ(k) s,a ) w (k) -A (k) s,a 2 . ( Σ (k) d (k) def = E (s,a)∼ d (k) φ (k) s,a ( φ (k) s,a ) . ( ) Here we use the superscript (k) for Σ (k) d (k) to distinguish the covariance matrix of the feature map Σ d (k) defined in (60) in the proof of Theorem 1, as the centered feature map φ (k) s,a depends on the iterates θ (k) . By Cauchy-Schwartz's inequality, we have 1 ≤ (s,a)∈S×A d (k+1) s,a ( φ(k) s,a ) (w (k) -w (k) ) ≤ (s,a)∈S×A d (k+1) s,a 2 d (k) s,a (s,a)∈S×A d (k) s,a ( φ(k) s,a ) (w (k) -w (k) ) 2 (84) = E (s,a)∼ d (k)   d (k+1) s,a d (k) s,a 2   w (k) -w (k) 2 Σ (k) d (k) . By further using the concentrability assumption 9, we have 1 (35) ≤ C ν w (k) -w (k) 2 Σ (k) d (k) ≤ C ν L A (w (k) , θ (k) , d (k) ) -L A (w (k) , θ (k) , d (k) ) ( ) (77) = C ν (k) stat , where ( 85) uses that w (k) is a minimizer of L A and w (k) is feasible (see the same arguments of (62) in the proof of Theorem 1).  d (k) s,a ( φ(k) s,a ) w (k) -A (k) s,a 2 = E (s,a)∼ d (k)   d (k+1) s,a d (k) s,a 2   L A (w (k) , θ (k) , d (k) ) (35)+(78) ≤ C ν (k) approx . Plugging ( 86) and ( 87) into ( 82) yields E s∼d * Φ(k) s w (k) , π (k+1) s ≥ ϑ ρ (1 -γ) V (k+1) ρ -V (k) ρ -ϑ ρ C ν (k) stat + (k) approx . Now for the second expectation in (81), by using the performance difference lemma (41) in Lemma 3, we have -E s∼d * Φ(k) s w (k) , π * s = -E (s,a)∼ d π * A (k) s,a + E (s,a)∼ d π * A (k) s,a -( φ(k) s,a ) w (k) = (1 -γ) V (k) ρ -V ρ (π * ) + E (s,a)∼ d π * A (k) s,a -( φ(k) s,a ) w (k) . The second term of (89) can be lower bounded. We first decompose it into two terms. That is, For the first one | a |, by Cauchy-Schwartz's inequality, we have E (s,a)∼ d π * A (k) s,a -( φ(k) s,a ) w (k) = E (s,a)∼ d π * A (k) s,a -( φ(k) s,a ) w (k) a + E (s,a)∼ d π * ( φ(k) s,a ) (w (k) -w (k) ) b . ( | a | ≤ (s,a)∈S×A d π * s,a A (k) s,a -( φ(k) s,a ) w (k) ≤ (s,a)∈S×A d π * s,a 2 d (k) s,a (s,a)∈S×A d (k) s,a ( φ(k) s,a ) w (k) -A (k) s,a 2 = E (s,a)∼ d (k)   d π * s,a d (k) s,a 2   L A (w (k) , θ (k) , d (k) ) (35)+(78) ≤ C ν (k) approx . For the second term | b | in (90), by Cauchy-Schwartz's inequality, we have | b | ≤ (s,a)∈S×A d π * s,a ( φ(k) s,a ) (w (k) -w (k) ) ≤ (s,a)∈S×A d π * s,a 2 d (k) s,a (s,a)∈S×A d (k) s,a ( φ(k) s,a ) (w (k) -w (k) ) 2 (84) = E (s,a)∼ d (k)   d π * s,a d (k) s,a 2   w (k) -w (k) 2 Σ (k) d (k) (35) ≤ C ν w (k) -w (k) 2 Σ (k) d (k) (85) ≤ C ν L A (w (k) , θ (k) , d (k) ) -L A (w (k) , θ (k) , d (k) ) (77) = C ν (k) stat . Thus, we lower bound (90) by -E s∼d * Φ(k) s w (k) , π * s (91)+(92) ≥ (1 -γ) V (k) ρ -V ρ (π * ) -C ν (k) stat + (k) approx . Substituting ( 88) and ( 93) into (81), dividing both side by 1 -γ and rearranging terms, we get ϑ ρ (δ k+1 -δ k ) + δ k ≤ D * k (1 -γ)η k - D * k+1 (1 -γ)η k + √ C ν (ϑ ρ + 1) 1 -γ (k) stat + (k) approx .

E.2 PROOF OF THEOREM 4

To prove Theorem 4, we prove the following general result with any initial policy by knowing that D * 0 ≤ log |A| when using the uniform initial policy in (48). Theorem 8. Fix a state distribution ρ, a state-action distribution ν, and a comparator policy π * . We consider the NPG method (32) with the step sizes satisfying η 0 ≥ 1-γ γ D * 0 and η k+1 ≥ 1 γ η k . Suppose that Assumptions 7, 8 and 9 hold. Then we have for all k ≥ 0, E V ρ (π (k) ) -V ρ (π * ) ≤ 1 - 1 ϑ ρ k 2 1 -γ + √ C ν (ϑ ρ + 1) 1 -γ √ stat + √ approx . Proof. From (79) in Lemma 9, by using the same increasing step size as in Theorem 5, i.e. η 0 ≥ 1-γ γ D * 0 and η k+1 ≥ η k /γ, and following the same arguments in the proof of Theorem 5 after (73), we obtain the final performance bound with the linear convergence rate E V ρ (π (k) ) -V ρ (π * ) ≤ 1 - 1 ϑ ρ k 2 1 -γ + √ C ν (ϑ ρ + 1) 1 -γ √ stat + √ approx .

F SUBLINEAR CONVERGENCE OF NPG WITH CONSTANT STEP SIZE

In this section, we provide the sublinear convergence of NPG with arbitrary constant step size. Theorem 9. Fix a state distribution ρ, an state-action distribution ν and an optimal policy π * . We consider the NPG method (32) with any constant step size η k = η > 0. Suppose that Assumptions 7, 8 and 9 hold. Then we have for all k ≥ 0, 1 k k-1 t=0 E V ρ (π (t) ) -V ρ (π * ) ≤ 1 (1 -γ)k D * 0 η + 2ϑ ρ + √ C ν (ϑ ρ + 1) 1 -γ √ stat + √ approx . Compared to Theorem 2, again here we require π * to be an optimal policy for the same reason as indicated in Footnote 5. Furthermore our sublinear convergence guarantees for both Q-NPG and NPG are the same. Compared to Theorem 29 in Agarwal et al. (2021) , the main differences are also similar to those for Q-NPG as summarized right after Theorem 2: our convergence rate improves from O(1/ √ k) to O(1/k); they use a diminishing step size of order O(1/ √ k) while we can take any constant step size we want. Despite the difference of using d (k) instead of d (k) for the compatible function approximation L A (w (k) , θ (k) , d (k) ), notice that same sublinear convergence rate O(1/k) is established by Liu et al. (2020) for NPG with constant step size, while their step size is bounded by the inverse of a smoothness constant and they further require that the feature map is bounded and the Fisher information matrix (9) is strictly lower bounded for all parameters θ ∈ R m (see this condition later in (94)). With such additional conditions, we are able to provide a Õ( 1 (1-γ) 5 2 ) sample complexity result of NPG in Appendix G. Proof. From (79) in Lemma 9 with the constant step size, we have ϑ ρ (δ k+1 -δ k ) + δ k ≤ D * k (1 -γ)η - D * k+1 (1 -γ)η + √ C ν (ϑ ρ + 1) 1 -γ (k) stat + (k) approx . Taking the total expectation with respect to the randomness in the sequence of the iterates w (0) , • • • , w (k-1) yields ϑ ρ (E [δ k+1 ] -E [δ k ]) + E [δ k ] ≤ E [D * k ] (1 -γ)η - E D * k+1 (1 -γ)η + √ C ν (ϑ ρ + 1) 1 -γ E (k) stat + E (k) approx ≤ E [D * k ] (1 -γ)η - E D * k+1 (1 -γ)η + √ C ν (ϑ ρ + 1) 1 -γ E (k) stat + E (k) approx (33)+(34) ≤ E [D * k ] (1 -γ)η - E D * k+1 (1 -γ)η + √ C ν (ϑ ρ + 1) 1 -γ √ stat + √ approx . By summing up from 0 to k -1, we get ϑ ρ E [δ k ] + k-1 t=0 E [δ t ] ≤ D * 0 (1 -γ)η + ϑ ρ δ 0 + k • √ C ν (ϑ ρ + 1) 1 -γ √ stat + √ approx . Finally, dropping the positive term E [δ k ] on the left hand side as π * is the optimal policy and dividing both side by k yields 1 k k-1 t=0 E V ρ (π (t) ) -V ρ (π * ) ≤ D * 0 (1 -γ)ηk + 2ϑ ρ (1 -γ)k + √ C ν (ϑ ρ + 1) 1 -γ √ stat + √ approx .

G SAMPLE COMPLEXITY OF NPG

Combined with a regression solver, NPG-SGD in Algorithm 5, which uses a slight modification of Q-NPG-SGD for the unbiased gradient estimates of L A , we consider a sampled-based NPG Algorithm 1 proposed in Appendix C and show its sample complexity result in the following corollary. Corollary 3. Consider the setting of Theorem 8. Suppose that the sample-based NPG Algorithm 1 is run for K iterations, with T gradient steps of NPG-SGD (Algorithm 5) per iteration. Furthermore, suppose that for all (s, a) ∈ S × A, we have φ s,a ≤ B with B > 0, and we choose the step size α = 1 8B 2 and the initialization w 0 = 0 for NPG-SGD. If for all θ ∈ R m , the covariance matrix of the centered feature map induced by the policy π(θ) and the initial state-action distribution ν satisfies E (s,a)∼ d θ φs,a (θ)( φs,a (θ)) ≥ µI m , where I m ∈ R m×m is the identity matrix and µ > 0, then (1-γ) 5 2 ; they consider a projection step for the iterates and incorrectly bound the stochastic gradient due to a similar error indicated in Footnote 3 (and see Appendix G.1 for more details), while we assume Fisher-non-degeneracy (94). E V ρ (π (K) ) -V ρ (π * ) ≤ 1 - 1 ϑ ρ K 2 1 -γ + (ϑ ρ + 1) C ν approx 1 -γ + 4 √ C ν (ϑ ρ + 1) (1 -γ) 2 √ T 2B 2 µ √ 2m + 1 + √ 2m . Compared to Corollary 1, the sample complexities for both Q-NPG and NPG are the same. The assumption (94) on the Fisher information matrix is much stronger than (31), as (31) is independent to the iterates. However, despite the difference of using ν instead of ρ, the Fisher-non-degeneracy ( 94) is commonly used in the optimization literature (Byrd et al., 2016; Gower et al., 2016; Wang et al., 2017) and in the RL literature (Liu et al., 2020; Ding et al., 2022; Yuan et al., 2022) . It characterizes that the Fisher information matrix behaves well as a preconditioner in the NPG update (8). Indeed, ( 94) is directly assumed to be positive definite in the pioneering NPG work (Kakade, 2001) and in the follow-up works on natural actor-critic algorithms (Peters & Schaal, 2008; Bhatnagar et al., 2009) . It is satisfied by a wide families of policies, including the Gaussian policy (Duan et al., 2016; Papini et al., 2018; Huang et al., 2020) and certain neural policy with log-linear policy as a special case. 

G.1 PROOF OF COROLLARY 3

There is a similar remark for the proof of Corollary 3 to the one right before the proof of Corollary 1 in Appendix D.5. We notice that there is the same error occurred for the proof of NPG sample complexity analysis in Agarwal et al. (2021) . Recall the stochastic gradient of L A in ( 44) ∇ w L A (w, θ, d θ ) = 2 w φs,a (θ) -A s,a (θ) φs,a (θ). It turns out that ∇ w L A (w, θ, d θ ) is unbounded, since the estimate A s,a (θ) of A s,a (θ) can be unbounded due to the unbounded length of the trajectory sampled in the sampling procedure, Algorithm 4. Thus, Agarwal et al. (2021) incorrectly verify ∇L A (w, θ, d θ ) bounded by claiming that A s,a (θ) is bounded by 2 1-γ . Proof sketch. Despite the difference of using either d θ or d θ in the loss function L A , we use the same assumptions of Liu et al. (2020), i.e., the Fisher-non-degeneracy (94) and the boundedness of the feature map, and verify all the conditions of Theorem 12 without relying on the boundedness of the stochastic gradient. In particular, similar to the proof of Corollary 1, we verify that E A s,a (θ) 2 is bounded even though A s,a (θ) is unbounded. To verify the condition (vi) in Theorem 12 in our proof, we use the same conditional expectation argument as in the proof of Corollary 1 to separate the correlated random variables A s,a (θ) and φs,a mismatch coefficients ϑ ρ used in our proof. Such difference comes from different nature of the proof techniques. Here the distribution mismatch coefficients ϑ ρ and the concentrability coefficients C ρ and C ν are potentially large in our convergence theories. We give extensive discussions on them, respectively. We also provide further noteworthy observations about the connection between the use of the increasing step size and the policy iteration. Distribution mismatch coefficients ϑ ρ . Our distribution mismatch coefficient ϑ ρ in (20) is the same as the one in Xiao (2022) . It contains both an upper bound and a lower bound. The linear convergence rate in our theories is 1 -1 ϑρ > 0. Thus, the smaller ϑ ρ is, the faster the resulting linear convergence rate. The best linear convergence rate is achieved when ϑ ρ achieves its lower bound. Here our analysis is general that it includes all the distribution mismatch coefficient ϑ ρ induced by any target state distribution ρ. Our results generalizes and sometimes also improves with respect to prior results. A very pessimistic and trivial upper bound on ϑ ρ is ϑ ρ ≤ 1 (1 -γ)ρ min . However, if the target state distribution ρ ∈ ∆(S) does not have full support, i.e., ρ s = 0 for some s ∈ S, then ϑ ρ might be infinite from this upper bound. Xiao (2022) just assumes that ϑ ρ is finite. We further propose a solution to this particular issue. Indeed, if ρ does not have full support, consider π * as an optimal policy. We can always convert the convergence guarantees for some state distribution ρ ∈ ∆(S) with full support, i.e., ρ s > 0 for all s ∈ S as follows: V ρ (π (k) ) -V ρ (π * ) = s∈S ρ s V s (π (k) ) -V s (π * ) = s∈S ρ s ρ s ρ s V s (π (k) ) -V s (π * ) ≤ ρ ρ ∞ s∈S ρ s V s (π (k) ) -V s (π * ) = ρ ρ ∞ V ρ (π (k) ) -V ρ (π * ) . Then we only need convergence guarantees of V ρ (π (k) ) -V ρ (π * ) for arbitrary ρ obtained from all our convergence analysis above. In this case, the linear convergence rate depends on ϑ ρ def = 1 1 -γ d π * (ρ ) ρ ∞ < ∞. Equation (20) provides the lower bound 1 1-γ for ϑ ρ . Such lower bound can be achieved when the target state distribution ρ satisfies that ρ = d π * (ρ) where π * is an optimal policy. The advantage of this case is that, not only it implies the best linear convergence rate, more importantly, the fast linear convergence rate is known to be γ. So we know the convergence rate explicitly without any estimation, even though the optimal policy or the policy iterates are unknown before training. Hence, we know when to stop running the algorithm. Lan (2022) only considers the case when ρ = d π * (ρ) and we are able to recover the same linear convergence rate γ in their result. Furthermore, the convergence performance V ρ (π (k) ) -V ρ (π * ) depends on the target state distribution ρ. If the optimal policy π * is independent to the target state distribution ρ which is usually the case in RL problems, then we are always allowed to fix ρ = d π * (ρ) for the analysis without knowing ρ and π * and derive this best linear convergence performance with rate γ, because we use the initial state-action distribution ν in training which is independent to ρ. Finally, from (20), if d (k) converges to d * , then ϑ k converges to 1. This might imply superlinear convergence results as Section 4.3 in Xiao (2022) . In this case, the notion of the distribution mismatch coefficients ϑ ρ no longer exists for the superlinear convergence analysis. In other words, it is no longer concerned. Concentrability coefficients C ν . The issue of having (potentially large) concentrability coefficients is unavoidable in all the fast linear convergence analysis of the inexact NPG that we are aware of, including even the tabular setting (e.g., Lan (2022) and Xiao (2022) ) and the log-linear policy setting (Cayci et al. (2021) , Chen & Theja Maguluri (2022) and ours). First, in the fast linear convergence analysis of inexact NPG, the concentrability coefficients appear from the errors, including the statistical error and the approximation error. Thus, one way to avoid having the concentrability coefficients appear is to consider the exact NPG in the tabular setting (See Theorem 10 in Xiao (2022) ). Because the tabular setting makes no approximation error and the exact NPG makes no statistical error. We consider the inexact NPG with the log-linear policy. Consequently, we have the concentrability coefficients multiplied by both the statistical error stat and the approximation error ( bias in Assumption 2 or approx in Assumption 5 and 8). To remove the concentrability coefficients, one has to make strong assumptions on the errors with the L ∞ supremum norm. In the tabular setting, Lan (2022) and Xiao (2022) assume that Q(π) -Q(π) ∞ ≤ stat . The cons of such strong assumption requires high sample complexity and is already explained in Appendix A.1. In the log-linear policy setting, Chen & Theja Maguluri (2022) assume that Q s (θ (k) ) -Φw (k) ∞ ≤ bias for the approximation error, which is a very strong assumption in the function approximation regime. Due to the supremum norm, bias is unlikely to be small, especially for large action spaces. Under this strong assumption, Lan (2022) , Xiao (2022) and Chen & Theja Maguluri (2022) are able to eliminate the concentrability coefficients. To avoid assuming such strong assumptions, Cayci et al. (2021) and our paper consider the expected L 2 errors in the log-linear policy setting, which are much weaker assumptions, especially much more reasonable for the approximation error bias compared to the one in Chen & Theja Maguluri (2022) . The tradeoff is that, the concentrability coefficients can not be eliminated in this case both in Cayci et al. (2021) and our results. Furthermore, as mentioned right after Theorem 4, under the expected error assumptions (Assumption 7 and 8), our concentrability coefficient C ν is better presented than the one in Assumption 2 in Cayci et al. (2021) in the sense that it is independent to the policies throughout the iterations thanks to the use of d (k) instead of d (k) (which is mentioned in Remark 1 as well) and is controllable to be finite by ν, while the one in Cayci et al. (2021) depends on the iterates, thus is unknown and is not guaranteed to be finite. Finally, like the distribution mismatch coefficient, the upper bound of C ν in ( 30) is very pessimistic. By the definition of C ν in (28), one can expect that C ν is closed to 1, when π (k) and π (k+1) converge to π * with π * the optimal policy. So our concentrability coefficient C ν is the "best" one among all concentrability coefficients in the sense that, it takes the weakest assumptions on errors compared to Lan (2022) , Xiao (2022) and Chen & Theja Maguluri (2022) , it does not impose any restrictions on the MDP dynamics compared to Cayci et al. (2021) and it can be controlled to be finite by ν when other concentrability coefficients are infinite (Scherrer, 2014) . It is still an open question whether we can obtain fast linear convergence results of the inexact NPG in the log-linear policy setting, with small error floor and a much improved concentrability coefficient, e.g., as the same magnitude as the one in Agarwal et al. (2021) . Increasing step size and the connection with policy iteration. As for the use of the increasing step size, intuitively, the reason is that both NPG in (17) and Q-NPG in ( 16) behave more and more like policy iteration. For instance, when η k → ∞ and we replace the linear approximation Φ s w (k) by Q s (θ (k) ), (16) becomes Q s (θ (k) ), p , ∀s ∈ S, which is exactly the classical Policy Iteration method (e.g., Puterman, 1994; Bertsekas, 2012) . We refer to Xiao (2022, Section 4.4) for more discussion on the connection with policy iteration.

I STANDARD OPTIMIZATION RESULTS

In this section, we present the standard optimization results from Beck (2017) ; Xiao (2022) ; Bach & Moulines (2013) used in our proofs. First, we present the closed form update of mirror descent with KL divergence on the simplex. We provide its proof for the completeness.



An advantage function should measure how much better is a compared to π, while here A is positive when a is worse than π. We keep calling A advantage function to better align with the convention in the RL literature. For simplicity, we present all our results with the uniform initial policy in the main paper. See Theorem 5, 7 and 8 in the Appendix for more general results with any initial policy. Indeed, the stochastic gradient of LQ is unbounded, since the estimate Qs,a(θ) of Qs,a(θ) is unbounded. This is because each single sampled trajectory has unbounded length. See Appendix D.5 for more explanations. This result appears after conference proceedings and is available on https://arxiv.org/pdf/2208. 03247.pdf. In our analysis, we need to drop the positive term E[Vρ(θ (k) )] -Vρ(π * ) to obtain a lower bound, thus require π * to be an optimal policy. We are aware thatAgarwal et al. (2021, Corollary 6.10) also useBach & Moulines (2013, Theorem 1) in an early version https://arxiv.org/pdf/1908.00261v2.pdf to obtain stat = O(1/T ). With further regularity (31),Agarwal et al. (2021) mentioned that stat = O(1/T ) can also be achieved throughHsu et al. (2012, Theorem 16). We have already mentioned in the comparison withAgarwal et al. (2021) right after Theorem 1 that, although we have linear convergence rates, the magnitude of our error floor is worse (larger) by a factor of ϑ ρ C ρ (ϑ ρ √ C ν for Theorem 3 and 4), due to the concentrability C ρ and the distribution A convex function f is proper if dom f is nonempty and for all x ∈ dom f , f (x) > -∞. A convex function is closed, if it is lower semi-continuous.



establish the O(1/ √ k) convergence rate for two-layer neural NAC with a projection step. The sublinear convergence results are also established byZanette et al. (2021) andHu et al. (2022) for the linear MDP(Jin et al., 2020).Agarwal et al. (2021) obtain the same O(1/ √ k) convergence rate for both projected NPG with smooth policies and projected Q-NPG with log-linear policies. This was later improved to O(1/k) by Liu et al. (2020) by replacing the projection step with a strong regularity condition on the Fisher information matrix, and it was also improved to O(1/k) by Xu et al. (2020) with NAC under Markovian sampling. The same O(1/k) convergence rate is established for log-linear policies by Chen et al. (2022) when considering the off-policy NAC. With entropy regularization and a projection step, Cayci et al. (2021) obtain a linear convergence for log-linear policies. Same entropy regularization and a projection step are applied by Cayci et al. (2022) for the neural NAC to improve the O(1/ √ k) convergence rate of Wang et al. (2020) to O(1/k).

NPG-SGD Input: Number of iterations T , step size α > 0, initialization w 0 ∈ R m , initial state-action measure ν, policy π(θ), discounted factor γ ∈ [0, 1) for t = 0 to T -1 do Call Algorithm 4 with the inputs (ν, π(θ), γ) to sample (s, a) ∼ d θ and A s,a (θ) Update w t+1 = w t -α ∇ w L A (w, θ, d θ ) by using (44) Output: w out = 1

) in Lemma 8, we will upper bound | 1 | and | 3 | by the statistical error assumption (19) and upper bound | 2 | and | 4 | by using the transfer error assumption (22). Indeed, to upper bound | 1 |, by Cauchy-Schwartz's inequality, we have

upper bound | 1 |, | 3 |, | a | and | c | by the statistical error assumption (19) as in the proof of Theorem 1. However, we will upper bound | 2 |, | 4 |, | b | and | d | by using the approximation error assumption (27) instead of the transfer error assumption (22).

We will upper bound the absolute values of the above two terms | 1 | and | 2 | separately. More precisely, similar to the proof of Theorem 3, we will upper bound the first term | 1 | by the statistical error assumption (33) and upper bound the second term | 2 | by using the approximation error assumption (34).

the second term | 2 | in (83), by Cauchy-Schwartz's inequality, we have

Now we will upper bound the absolute values of the above two terms | a | and | b | separately.

Now we compare our Corollary 3 with Corollary 33 inAgarwal et al. (2021), which is their corresponding sample complexity results for NPG. The main differences between Corollary 3 and Corollary 33 inAgarwal et al. (2021) are similar to those for Q-NPG as summarized right after Corollary 1: Their sample complexity is O 1 (1-γ) 11 6 while ours is Õ 1

(θ) with (s, a) ∼ d θ appeared in the stochastic gradient. Thanks to this argument, we fix a flaw in the previous proof of Liu et al. (2020, Proposition G.1) 7 . Proof. Similar to the proof of Corollary 1, we suppress the subscript k. First, the centered feature map is bounded by φs,a (θ) ≤ 2B. In order to apply Theorem 12, it remains to upper bound 7 In a previous version of the proof in Section G, Liu et al. (2020, Proposition G.1) use the inequality E[( As,a(θ) -w φs,a(θ)) 2 φs,a(θ)( φs,a(θ)) ] ≤ E[( As,a(θ) -w φs,a(θ)) 2 ]E[ φs,a(θ)( φs,a(θ)) ], which is incorrect since As,a(θ) and φs,a(θ) are correlated random variables. To fix it, we use the following conditional expectation computation argument E[( As,a(θ) -w φs,a(θ)) 2 φs,a(θ) φs,a(θ) ] = E[E[( As,a(θ) -w φs,a(θ)) 2 | s, a] φs,a(θ)( φs,a(θ)) ], and bound the term E As,a(θ) -w φs,a(θ) 2 | s, a in (95).

Overview of different convergence results for NPG methods in the function approximation regime. The darker cells contain our new results. The light cells contain previously known results for NPG or Q-NPG with log-linear policies that we have a direct comparison to our new results. White cells contain existing results that do not have the same setting as ours, so that we could not make a direct comparison among them.

We refer toLiu et al. (2020, Section B.2) andDing et al. (2022, Section 8)  for more discussions on the Fisher-non-degenerate setting.To prove Corollary 3, our approach is inspired from the proof of the sample complexity analysis ofLiu et al. (2020, Theorem 4.9). That is, we require the Fisher-non-degeneracy (94) and apply Theorem 12 to the minimization of function L A (w, θ, d θ ) without relying on the boundedness of the stochastic gradient. A proof sketch is provided in Appendix G.1. Compared to their result, they obtain worse O

ACKNOWLEDGMENTS

We gratefully acknowledge Daniel Russo who pointed out that we did not cite properly Bhandari & Russo (2021) in the literature review in the previous version.We acknowledge the helpful discussion with Yanli Liu on the sample complexity analysis of both Q-NPG and NPG.We would also like to thank the anonymous reviewers for their helpful comments.

Appendix

E A s,a (θ) φs,a (θ) 2 and w with w ∈ argmin w L A (w, θ, d θ ), and find σ > 0 such that E A s,a (θ) -w φs,a (θ)-2A s,a (θ)w φs,a (θ) + w φs,a (θ) 2 ≤ σ 2 (96)holds for all (s, a) ∈ S × A and θ ∈ R m .Similar to the proof of Corollary 1, the closed form solution of w can be written as w = E (s,a)∼ d θ φs,a (θ) φs,a (θ) † E (s,a)∼ d θ Q s,a (θ) φs,a (θ) .From (94), we have.Now we need to upper bound E A s,a (θ)

2

| s, a from (95). Indeed, by usingwhere the last line is obtained, as E V s,a (θ) | s, a shares the same upper bound ( 74) of| s, a by using the similar argument.From (97) and φs,a (θ) ≤ 2B, we verify E A s,a (θ) φs,a (θ) 2 bounded as well.By using the upper bounds of E A s,a (θ) | s, a , w , |A s,a (θ)| ≤ 2 1-γ and φs,a (θ) ≤ 2B, the left hand side of ( 95) is upper bounded byThus, we chooseNow all the conditions (i) -(vi) in Theorem 12 are verified. The reminder of the proof follows that of Corollary 1.Lemma 10 (Mirror descent on the simplex, Example 9.10 in Beck (2017) ). Let g ∈ R n which will often be a gradient and let η > 0. For p, q in the unit n-simplex ∆ n , the mirror descent step with respect to the KL divergence min p∈∆ n η g, p + D(p, q) (98) is given bywhere is the element-wise product between vectors.Proof. The Lagrangian of ( 98) is given bywhere µ ∈ R and λ ∈ R n with non-negative coordinates are the Lagrangian multipliers. Thus the Karush-Kuhn-Tucker conditions are given by ηg + log(p/q) + 1 n = µ1 n + λ,where the division p/q is element-wise. Isolating p in the top equation gives p = q e (µ-1)1n+λ-ηg = e µ-1 q e λ-ηg .Using the second constraint 1 n p = 1 gives thatq i e λi-ηgi =⇒ e µ-1 = 1 n i=1 q i e λi-ηgi .Consequently, by plugging the above term into p, we have that p = q e λ-ηg n i=1 q i e λi-ηgi . It remains to determine λ. If q i = 0 then p i = 0 and thus λ i > 0. Conversely, if q i > 0 then p i > 0 and thus λ i = 0. In either of these cases, we have that the solution is given by ( 99). Now we present the three-point descent lemma on proximal optimization with Bregman divergences, which is another key ingredient for our PMD analysis. Following Xiao (2022, Lemma 6), we adopt a slight variation of Lemma 3.2 in Chen & Teboulle (1993) . First, we need some technical conditions. Definition 10 (Legendre function, Section 26 in Rockafellar (1970) ). We say a function h is of Legendre type or a Legendre function if the following properties are satisfied:(i) h is strictly convex in the relative interior of dom h, denoted as rint dom h.(ii) h is essentially smooth, i.e., h is differentiable in rint dom h and, for any boundary point x b of rint dom h, limDefinition 11 (Bregman divergence (Bregman, 1967; Censor & Zenios, 1997) ). Let h : dom h → R be a Legendre function and assume that rint dom h is nonempty. The Bregman divergence D h (•, •) : dom h × rint dom h → [0, ∞) generated by h is a distance-like function defined asUnder the above conditions, we have the following result. We also provide its proof for selfcontainment. (Xiao (2022) does not provide a formal proof.)Lemma 11 (Three-point decent lemma, Lemma 6 in Xiao ( 2022)is the Bregman divergence generated by a function h of Lengendre type and rint dom h ∩ C = ∅. For any x ∈ rint dom h, letThen x + ∈ rint dom h ∩ C and for any u ∈ dom h ∩ C,Proof. First, we prove that for any a, b ∈ rint dom h and c ∈ dom h, the following identity holds:Indeed, using the definition of D h in (100), we haveSubtracting ( 102) and ( 103) from ( 104) yields (101).Next, since h is of Legendre type, we have x + ∈ rint dom h ∩ C. Otherwise, x + is a boundary point of dom h. From the definition of Legendre function, ∇h(x + ) = ∞ which is not possible, as x + is also the minimum point of f (u) + D h (u, x). By the first-order optimality condition, we havewhere g + ∈ ∂f (x + ) is the subdifferential of f at x + . From the definition of D h , the above inequality is equivalent toBesides, plugging c = u, a = x + and b = x into (101), we obtain≥ x + -u, g + .Rearranging terms and adding f (u) on both sides, we havewhich concludes the proof. The last inequality is obtained by the convexity of f and g + ∈ ∂f (x + ).Finally, we use the following linear regression analysis for the proof of our sample complexity results, i.e., Corollary 1 and 3. Theorem 12 (Theorem 1 in Bach & Moulines (2013) ). Consider the following assumptions:(i) H is a m-dimensional Euclidean space.(ii) The observations (x n , z n ) ∈ H × H are independent and identically distributed.(iii) E x n 2 and E z n 2 are finite. The covariance E x n x n is assumed invertible.(iv) The global minimum of f (θ) = 1 2 E θ, x n 2 -2 θ, z n is attained at a certain θ * ∈ H.Let ξ n = z n -θ * , x n x n denote the residual. We have E [ξ n ] = 0.Published as a conference paper at ICLR 2023 (v) Consider the stochastic gradient recursion defined asstarted from θ 0 ∈ H and also consider the averaged iterates θ out = 1 n+1 n k=0 θ k .(vi) There exists R > 0 and σ > 0 such that E ξ n ξ n ≤ σ 2 E x n x n and E x n 2 x n x n ≤ R 2 E x n x n .When η = 1 4R 2 , we have(106)

