LINEAR CONVERGENCE FOR NATURAL POLICY GRADI-ENT WITH LOG-LINEAR POLICY PARAMETRIZATION

Abstract

We analyze the convergence rate of the unregularized natural policy gradient algorithm with log-linear policy parametrizations in infinite-horizon discounted Markov decision processes. In the deterministic case, when the Q-value is known and can be approximated by a linear combination of a known feature function up to a bias error, we show that a geometrically-increasing step size yields a linear convergence rate towards an optimal policy. We then consider the sample-based case, when the best representation of the Q-value function among linear combinations of a known feature function is known up to an estimation error. In this setting, we show that the algorithm enjoys the same linear guarantees as in the deterministic case up to an error term that depends on the estimation error, the bias error, and the condition number of the feature covariance matrix. Our results build upon the general framework of policy mirror descent and extend previous findings for the softmax tabular parametrization to the log-linear policy class.

1. INTRODUCTION

Sequential decision-making represents a framework of paramount importance in modern statistics and machine learning. In this framework, an agent sequentially interacts with an environment to maximize notions of reward. In these interactions, an agent observes its current state s ∈ S, takes an action a ∈ A according to a policy that associates to each state a probability distribution over actions, receives a reward, and transitions to a new state. Reinforcement Learning (RL) focuses on the case where the agent does not have complete knowledge of the environment dynamics. One of the most widely-used classes of algorithms for RL is represented by policy optimization. In policy optimization algorithms, an agent iteratively updates a policy that belongs to a given parametrized class with the aim of maximizing the expected sum of discounted rewards, where the expectation is taken over the trajectories induced by the policy. Many types of policy optimization techniques have been explored in the literature, such as policy gradient methods (Sutton et al., 1999) , natural policy gradient methods (Kakade, 2002) , trust region policy optimization (Schulman et al.) , and proximal policy optimization (Schulman et al., 2017) . Thanks to the versatility of the policy parametrization framework, in particular the possibility of incorporating flexible approximation schemes such as neural networks, these methods have been successfully applied in many settings. However, a complete theoretical justification for the success of these methods is still lacking. The simplest and most understood setting for policy optimization is the tabular case, where both the state space S and the action space A are finite and the policy has a direct parametrization, i.e. it assigns a probability to each state-action pair. This setting has received a lot of attention in recent years and has seen several developments (Agarwal et al., 2021; Xiao, 2022) . Its analysis is particularly convenient due to the decoupled nature of the parametrization, where the probability distribution over the action space that the policy assigns to each state can be updated and analyzed separately for each state. This leads to a simplified analysis, where it is often possible to drop discounted visitation distribution terms in the policy update and take advantage of the contractivity property typical of value and policy iteration methods. Recent results involve, in particular, natural policy gradient (NPG) and, more generally, policy mirror descent, showing how specific choices of learning rates yield linear convergence to the optimal policy for several formulations and variations of these algorithms (Cen et al., 2021; Zhan et al., 2021; Khodadadian et al.; Xiao, 2022; Li et al., 2022; Lan, 2022; Bhandari and Russo, 2021; Mei et al., 2020) . Two of the main shortfalls of these methods are their computational and sample complexities, which depend polynomially on the cardinality of the state and action spaces, even in the case of linear convergence. Indeed, by design, these algorithms need to update at each iteration a parameter or a probability for all state-action pairs, which has an operation cost proportional to |S||A|. Furthermore, in order to preserve linear convergence in the sample-based case, the aforementioned works assume that the worst estimate (ℓ ∞ norm) of Q π (s, a)-which is the expected sum of discounted rewards starting from the state-action pair (s, a) and following a policy π-is exact up to a given error threshold. Without further assumptions, meeting this threshold requires a number of samples that depends polynomially on |S||A|. A promising approach to deal with large and high-dimensional spaces that is recently being explored is that of assuming that the environment has a low-rank structure and that, as a consequence, it can be described or approximated by a lower dimensional representation. In particular, a popular framework is that of linear function approximation, which consists in assuming that quantities of interest in the problem formulation, such as the transition probability (Linear MDPs) or the action-value function Q π of a policy π can be approximated by a linear combination of a certain d-dimentional feature function ϕ : S × A → R d up to a bias error ε bias . This linear assumption reduces the dimensionality of the problem to that of the feature function. In this setting, many researchers have proposed methods to learn the best representation ϕ (Agarwal et al., 2020; Modi et al., 2021; Uehara et al., 2021; Zhang et al., 2022) and to exploit it to design efficient vairations of the upper confidence bound (UCB) algorithm, for instance (Jin et al., 2020; Li et al., 2021; Wagenmaker et al., 2022) . When applying the framework of linear approximation to policy optimization, researchers typically adopt the log-linear policy class, where a policy π θ parametrized by θ ∈ R d is defined as proportional to exp(θ ⊤ ϕ). For this policy class, several works have obtained improvements in terms of computational and sample complexity, as the policy update requires a number of operations that scales only with the feature dimension d and the estimation assumption to retain convergence rates in the sample-based setting is weaker than the tabular counterpart. In fact, theoretical guarantees for these algorithms only assume the expectation of Q π over a known distribution on the state and action spaces to be exact up to a statistical error ε stat . In the linear function approximation setting meeting this assumption typically requires a number of samples that is only a function of d and it does not depend on |S| and |A| Telgarsky (2022). However, a complete understanding of the convergence rate of policy optimization methods in this setting is still missing. Recent results include sublinear convergence rates for unregularized NPG (Agarwal et al., 2021; Qiu et al., 2021; Zanette et al., 2021; Hu et al., 2021 ) and linear convergence rates for entropy-regularized NPG with bounded updates (Cayci et al., 2021) . Our work fills the gap between the aforementioned findings and it extends the analysis and results of the tabular setting to the linear function approximation setting. In particular, we show that, under the standard assumptions on the (ε stat , ε bias )-approximation of Q π mentioned above, a choice of geometrically-increasing step-sizes leads to linear convergence of NPG for the log-linear policy class in both deterministic and sample-based settings. Our result directly improves upon the sublinear iteration complexity of NPG previously established for the log-linear policy class by Agarwal et al. (2021) and Hu et al. (2021) and it removes the need for entropy regularization and bounded step-sizes used by Cayci et al. (2021) , under the same assumptions on the linear approximation of Q π . Moreover, we have that the number of operations needed for the policy update and the number of samples needed to preserve the convergence rate in the sample-based setting depend on the dimension d of ϕ, as opposed to the tabular setting where the same quantities depend on |S||A|. By extending the linear convergence rate of NPG from the tabular softmax parametrization to the setting of log-linear policy parametrizations, our result directly addresses the research direction outlined in the conclusion of Xiao (2022) , and it overcomes the aforementioned limitations of the tabular settings. Our analysis is based on the equivalence of NPG and policy mirror descent with KL divergence (Raskutti and Mukherjee, 2015) , which has been exploited for applying mirror-descent-type analysis to NPG by several works, such as Agarwal et al. (2021) ; Hu et al. (2021) ; Cayci et al. (2021) . The advantages of this equivalence are twofold. Firstly, NPG crucially ensures a simple update rule, i.e. log π t+1 (a|s) = log π t (a|s) + η t Q πt (s, a) , which in the particular case of the log-linear policy class translates into θ ⊤ t+1 ϕ(s, a) = θ ⊤ t ϕ(s, a) + η t Q πt (s, a). Secondly, the mirror descent setup is particularly useful to iteratively control the updates and the approximation errors, e.g. through tools like the three-point descent lemma (see (3) below), and to induce telescopic sums or recursions that are often used to analyze the converge rate of the last iterate. In our work, we show how to exploit these advantages to use the linear approximation of Q π in the analysis and, consequently, make weaker assumptions on the accuracy of the estimation of Q π w.r.t. the tabular setting. While previous results for the tabular setting (Cen et al., 2021; Zhan et al., 2021; Xiao, 2022) require an ℓ ∞ norm bound on the estimation error, i.e. ∥ Q π -Q π ∥ ∞ ≤ τ , our convergence guarantee depends on the expected error of the estimate, i.e. E( Q π (s, a) -Q π (s, a)) 2 ≤ ε stat , where the expectation is taken w.r.t. the discounted state visitation distribution induced by the policy π and the uniform distribution over the action space. This allows us to employ sample-efficient policy evaluation algorithms, such as temporal difference learning (Hu et al., 2021; Telgarsky, 2022) , and to remove the cardinality of the state and action spaces |S||A| from the sample complexity of the algorithm. The paper is organized as follows. Section 2 introduces the main setting of RL, and Section 3 introduces the algorithm framework we consider. Section 4 contains the linear approximation set-up and our main result. Section 5 presents the analysis of our main result, with the conclusions outlined in Section 6.

2. SETTING

Consider an agent that acts in a discounted Markov Decision Process (MDP) M = (S, A, P, r, γ, µ), where: S is the possibly infinite state space and A is the finite action space; P (s ′ |s, a) is the transition probability; r(s, a) ∈ [0, 1] is the reward function; γ is the discount factor; and µ is the starting state distribution. A policy π : S × A → R is a probability distribution over A that represents the probability that an agent takes action a when in state s. At time t denote the current state and action by s t and a t . For a policy π, let V π : S → R be the respective value function, which is defined as the expected discounted cumulative reward with starting state s 0 = s, namely, V π s = E ∞ t=0 γ t r(s t , a t ) π, s 0 = s , where a t ∼ π(•|s t ) and s t+1 ∼ P (•|s t , a t ). Let V π (µ) = E s∼µ V π s . The agent aims to find an optimal policy π ⋆ ∈ argmax π V π (µ). For a policy π, let Q π : S × A → R be the respective action-value function, or Q-function, which is defined as the expected discounted cumulative reward with starting state s 0 = s and starting action a 0 = a, namely, Q π (s, a) = E ∞ t=0 γ t r(s t , a t ) π, s 0 = s, a 0 = a , where a t ∼ π(•|s t ) and s t+1 ∼ P (•|s t , a t ). Define the discounted state visitation distribution (Sutton et al., 1999 ) d π µ (s) = (1 -γ)E s0∼µ ∞ t=0 γ t P (s t = s|π, s 0 ), and the discounted state-action visitation distribution d π ρ (s, a) = (1 -γ)E s0,a0∼ρ ∞ t=0 γ t P (s t = s|π, s 0 , a 0 ), where the trajectory (s t , a t ) t≥0 is generated by the MDP following policy π and ρ is a distribution over S ×A. Then we can formulate the performance difference lemma (Kakade and Langfor, 2002) , a tool that will prove useful in our analysis, V π (µ) -V π (µ) = 1 1 -γ E s∼d π µ a∈A Q π (s, a)(π(a|s) -π(a|s)). (1)

2.1. NOTATION

We make the following definitions for ease of exposition. As to the policy, let π s := π(s, •) and π t := π θt . For two functions f and g, denote (f • g)(x, y) = f (x)g(y). As to the action-value function, let Q π s := Q π (s, •) and Q t (s, a) := Q πt (s, a). As to the discounted visitation distributions, let d t µ := d π t µ , d t = d t µ • Unif A , and d ⋆ := d ⋆ µ • Unif A . Lastly, denote KL ⋆ t = E s∼d ⋆ µ KL(π ⋆ s , π t s ).

3. NATURAL POLICY GRADIENT AND MIRROR DESCENT

Policy class -In this work, we consider the log-linear policy parametrization (Agarwal et al., 2021) . Let θ ∈ R d be a parameter vector and ϕ : S × A → R d be a feature function. Then the policy class consists of all policies of the form: π θ (a|s) = exp(θ ⊤ ϕ(s, a)) a ′ ∈A exp(θ ⊤ ϕ(s, a ′ )) . Natural Policy Gradient -We formulate NPG through mirror descent. The update at time t + 1 is ∇h(π t+1 s ) = ∇h(π t s ) + η t Q t s ∀s ∈ S, where h(π s ) = a∈A π(a|s) log π(a|s), is the entropy mirror map. This is equivalent to the update s,a) ∀s, a ∈ S, A, or, as in Algorithm 1, to requiring that θ t+1 is such that π t+1 (s, a) ∝ π t (s, a)e ηtQ t ( θ ⊤ t+1 ϕ(s, a) = θ ⊤ t ϕ(s, a) + η t Q t (s, a) ∀s, a ∈ S, A. In the tabular setting, we have d = |S||A|, ϕ(s, a) is a vector of all zeros except a one in the position assigned to (s, a), and the update is equivalent to the one analyzed by (Agarwal et al., 2021) . This mirror descent setup allows us to use standard mirror descent tools in the analysis (Bubeck, 2015; Hu et al., 2021; Xiao, 2022) , such as three-point descent lemma D h (π s , π t s ) -D h (π s , π t+1 s ) -D h (π t+1 s , π t s ) = ⟨∇h(π t s ) -∇h(π t+1 s ), π t+1 s -π s ⟩ ∀π s , which in this setting can be expressed as KL(π s , π t s ) -KL(π s , π t+1 s ) -KL(π t+1 s , π t s ) = -η t ⟨Q t s , π t+1 s -π s ⟩ ∀π s . These tools, along with the performance difference lemma (1), ensure that we can control the increase of the value function for each policy update. When we only have access to an approximation Q π of Q π , these tools allow controlling the error of this approximation by means of simple triangle inequality arguments, making possible the incorporation of the linear function approximation framework where Q π is approximated by a linear combination of the feature function ϕ.

4. MAIN RESULT

In this section, we present our main result on the linear convergence of NPG. We start by introducing and discussing the assumptions and the algorithm.

4.1. ALGORITHM AND LINEAR FUNCTION APPROXIMATION

We make the following two assumptions on the linear approximation of Q π , which are standard in the literature (Agarwal et al., 2021; Cayci et al., 2021) . (5) Assume that ∀t < T we have L(w t , θ t , d ⋆ ) ≤ ε bias , L(w t , θ t , d t+1 µ • Unif A ) ≤ ε bias . In order to better understand the implications of Assumption 4.1, we consider the trivial upper bound (Agarwal et al., 2021 ) L(w t , θ t , d ⋆ ) ≤ d ⋆ d t ρ ∞ L(w t , θ t , d t ρ ) ≤ 1 1 -γ d ⋆ ρ ∞ L(w t , θ t , d t ρ ). This bound allows to think of ε bias in Assumption 4.1 as controlling two quantities of interest. The first quantity is the loss incurred by the minimizer of L(w, θ t , d t ρ ), that is the best approximation Assumption 4.2 concerns the statistical error incurred when solving the minimization problem in ( 5) and it can be used to describe the sample complexity of the algorithm. Let Q π (s, a) = w ⊤ t ϕ(s, a) be the sample-based estimate of Q π . Then Assumption 4.2 is equivalent to assuming that Q t = w ⊤ t ϕ of Q t with E s,a∼d t ρ Q π (s, a) -Q π (s, a) 2 ≤ ε stat . Several algorithms have been shown to satisfy Assumption 4.2 with a number of samples that depends only on the dimension d of ϕ and not on |S| or |A|, such as temporal difference learning (Telgarsky, 2022) . This represent an improvement over the sample complexity of tabular algorithms, where the typical assumption (Xiao, 2022; Li et al., 2022) causes the sample complexity to depend on |S||A|. ∥ Q π -Q π ∥ ∞ ≤ ε stat With this set-up, we can formulate NPG with linear function approximation as in Algorithm 1. At time step t, let D(t, ρ) be an oracle such that w t = D(t, ρ) satisfies Assumption 4.2.

Algorithm 1: NPG with linear function approximation

Input: Learning rate schedule (η t ) t≥0 ; number of iterations T ; initialized policy π (0) ; distribution ρ; oracle D. for t = 0, . . . , T -1 do Obtain w t = D(t, ρ). Update θ t+1 = θ t + η t w t . end for Remark 4.3. (Tabular setting) It is possible to recover the tabular case by setting d = |S||A| and ϕ(s, a) to be a vector of all zeros except a one in the position assigned to (s, a). In this case, we recover the same update as the tabular setting and we have that ε bias = 0, as by setting w = Q t we obtain E s∼ν,a∼Unif A Q t (s, a) -w ⊤ ϕ(s, a) 2 = 0 for any distribution ν. Remark 4.4. (Linear MDPs) Another setting for which the bias error ε bias is equal to 0 is that of Linear MDPs (Jin et al., 2020) , where it is assumed that the transition probability distribution and the reward function can be expressed as a linear function of the feature function ϕ. Namely, assume there exist two feature maps ϕ : S × A → R d and µ : S → R d and a vector v r ∈ R d such that P (s ′ |s, a) = ⟨π(s, a), µ(s ′ )⟩, r(s, a) = ⟨v r , ϕ(s, a)⟩ ∀s, s ′ ∈ S, a ∈ A. If this assumption is satisfied, then we have that ∀s ∈ S, a ∈ A Q π (s, a) = r(s, a) + γ S V π (s ′ )P (s ′ |s, a)ds ′ = ϕ(s, a), v r + γ S V π (s ′ )µ(s ′ )ds ′ , which means that at each time step t there exists a w t ∈ R d such that Q t (s, a) = ⟨w t , ϕ(s, a)⟩ and L(w t , θ t , d t µ ) = 0.

4.2. LINEAR CONVERGENCE

In order to present the main result of our work, we need two additional assumptions on the distribution mismatch coefficient and the feature covariance matrix. Assumption 4.5. Assume that the distribution mismatch coefficient ν µ = 1 1 -γ d ⋆ µ µ ∞ is finite, i.e. ν µ < ∞. Assumption 4.5 is a standard assumption in the policy optimization literature (Agarwal et al., 2021; Xiao, 2022) . As we will see in Theorem 4.7, the iteration complexity of Algorithm 1 depends polynomially on this term, meaning that the convergence rate is faster when the starting state distribution µ covers the whole state space. Assumption 4.6. (Relative condition number) With respect to a distribution v, define Σ v = E s,a∼v ϕ(s, a)ϕ(s, a) ⊤ and assume that there exists a κ < ∞ such that sup w∈R d w ⊤ Σ d ⋆ w w ⊤ Σ ρ w ≤ κ, sup w∈R d w ⊤ (Σ d t µ •Unif A )w w ⊤ Σ ρ w ≤ κ ∀t ≤ T. Assumption 4.6 is a standard assumption in the linear function approximation literature and highlights the importance of choosing a state-action distribution ρ with good coverage over the feature space, as it can be enforced by choosing the appropriate ρ. In fact, if Φ = {ϕ(s, a)|s ∈ S, a ∈ A} is a compact set, there always exists a state-action distribution ρ such that κ ≤ d (see Lemma 23 in Agarwal et al. (2021) ). In general, if ∥ϕ(s, a)∥ 2 2 ≤ B for all s ∈ S, a ∈ A, we have the crude bound κ ≤ B/σ min (Σ ρ ), where σ min (A) is the minimum eigenvalue of matrix A. We are now ready to state the following theorem on the linear convergence of Algorithm 1. Theorem 4.7. (Linear convergence of NPG with log-linear parametrization) Consider NPG as in Algorithm 1 and let Assumptions 4.1, 4.2, 4.5, and 4.6 hold. If the step-size schedule satisfies η t+1 ≥ ν µ ν µ -1 η t ∀t, and η 0 ≥ 1-γ γ KL ⋆ t , then for every T ≥ 0 we have V ⋆ (µ) -V T (µ) ≤ 1 - 1 ν µ T 2 1 -γ + 2ν µ |A|κε stat (1 -γ) 3 + 2ν µ |A|ε bias 1 -γ . To the best of our knowledge, Theorem 4.7 represents the first result establishing linear convergence rates for NPG with unbounded updates and without entropy regularization for the log-linear policy class. The convergence rate has no explicit dependence on the cardinality of the state and action spaces, with the exception of the two |A| terms which, as already highlighted by Agarwal et al. (2021) , can be removed with a path-dependent bound. In the case where ε bias = 0 and ε stat = 0, the theorem recovers the same convergence rate as Theorem 10 in Xiao (2022) . To obtain the sample complexity of the algorithm, we take advantage of the theory for temporal difference learning developed by Telgarsky (2022) . In particular, we have that in order to satisfy Assumption 4.2 for all T iterations with high probability we need O T ∥wt∥ 2 2 εstat(1-γ) 3 samples. Combining this quantity with the iteration complexity from Theorem 4.7, we obtain a total sample complexity of O ∥w t ∥ 2 2 ε 2 (1 -γ) 10 d ⋆ µ µ ∞ . Remark 4.8. (Different policy parametrizations) While our work focuses on the log-linear policy class, it is possible to extend our framework and our analysis to general function approximation schemes. Let f θ : S × A → R be a parameterized function and define the policy class {π θ |θ ∈ Θ} as π θ (a|s) = exp(f θ (s, a)) a ′ ∈A exp(f θ (s, a ′ )) . Let g ω be a parametrized operator of f θ , define the loss function L(ω, θ, ν) := E s∼ν,a∼Unif A (Q π θ (s, a) -g ω (f θt (s, a))) 2 and let ω t ∈ argmin ω L(ω, θ t , d t µ ). Then, Assumption 4.1 becomes L(ω t , θ t , d ⋆ µ ) ≤ ε bias ∀t < T. The NPG update (2) can then be formulated as requiring θ t+1 to be such that f θt+1 (s, a) = f θt (s, a) + η t g ωt (f θt (s, a)) ∀s ∈ S, a ∈ A. Finding methods to solve this system of equations is beyond the scope of this work.

5. ANALYSIS

In order to prove Theorem 4.7, we need some intermediate results. The first one regards the decomposition of the statistical and the bias errors. Lemma 5.1. The expected error of the estimate Q t s of Q t s can be bounded as follows E s∼v ⟨Q t s -Q t s , π t s -π s ⟩ ≤ 2 |A|κε stat 1 -γ + 2 |A|ε bias ∀t < T, for both v = d t+1 µ , π s = π t+1 s and v = d ⋆ µ , π s = π ⋆ s . Proof of Lemma 5.1. We start by adding and subtracting Q t s E s∼v ⟨Q t s -Q t s , π t s -π t+1 s ⟩ ≤ E s∼v ⟨ Q t s -Q t s , π t s -π s ⟩ + E s∼v ⟨Q t s -Q t s , π t s -π s ⟩ . We then bound the two terms on the right-hand side separately. For the first term, we have that  E s∼v ⟨ Q t s -Q t s , π t s -π s ⟩ ≤ E s∼v, t -w t ∥ 2 Σv•Unif A ≤ κ ∥w t -w t ∥ 2 Σρ ≤ κ 1 -γ ∥w t -w t ∥ 2 Σ d t ρ ≤ κε stat 1 -γ . Similarly, for the second term we have E s∼v ⟨Q t s -Q t s , π t s -π s ⟩ ≤ 2 |A|E s∼v,a∼Unif A Q t (s, a) -Q t (s, a) 2 ≤ 2 |A|ε bias . For ease of exposition, in the rest of this section denote τ := 2 |A|κε stat 1 -γ + 2 |A|ε bias . The next lemma regards the quasi-monotonic improvements of Algorithm 1. Let Q t s = w ⊤ t ϕ(s, •). Lemma 5.2. The updates of Algorithm 1 satisfy, for all s ∈ S, ⟨ Q t s , π t+1 s -π t s ⟩ ≥ 0 and V t+1 (µ) -V t (µ) ≥ - τ 1 -γ . Proof of Lemma 5.2. By 1-strong convexity of h on (0, 1), we have ∀s ∈ S 0 ≤ π t+1 s -π t s 2 2 ≤ ⟨∇h(π t+1 s ) -∇h(π t s ), π t+1 s -π t s ⟩ = ⟨ Q t s , π t+1 s -π t s ⟩. As to the second inequality, we use the performance difference lemma (1) and Lemma 5.1 to obtain (1 -γ)(V t+1 (µ) -V t (µ)) = E s∼d t+1 µ ⟨Q t s , π t+1 s -π t s ⟩ = E s∼d t+1 µ ⟨ Q t s , π t+1 s -π t s ⟩ + E s∼d t+1 µ ⟨Q t s -Q t s , π t+1 s -π t s ⟩ ≥ - τ 1 -γ . The last result we need is the following lemma, which can be straightforwardly proven by induction. Lemma 5.3. Suppose 0 < α < 1, b > 0 and a nonnegative sequence {a k } satisfies a k+1 ≤ αa k + b ∀k ≥ 0. Then for all k ≥ 0, a k ≤ α k a 0 + b 1 -α . With these results in place, we are ready to prove Theorem 4.7. Proof of Theorem 4.7. Let ν k = d ⋆ d t+1 µ ∞ and consider the equality in ( 4) KL(π s , π t s ) -KL(π s , π t+1 s ) -KL(π t+1 s , π t s ) = -η t ⟨Q t s , π t+1 s -π s ⟩ ∀π s . Then, for π s = π ⋆ s we have that E s∼d ⋆ µ ⟨ Q t s , π t s -π t+1 s ⟩ + E s∼d ⋆ µ ⟨ Q t s , π ⋆ s -π t s ⟩ ≤ KL ⋆ t -KL ⋆ t+1 . We bound the two terms on the left-hand side separately. For the first one, we have that E s∼d ⋆ µ ⟨ Q t s , π t s -π t+1 s ⟩ ≥ d ⋆ d t+1 µ ∞ E s∼d t+1 µ ⟨ Q t s , π t s -π t+1 s ⟩ = ν k+1 (1 -γ) V t (µ) -V t+1 (µ) + ν k+1 E s∼d t+1 µ ⟨ Q t s -Q t s , π t s -π t+1 s ⟩ ≥ ν k+1 (1 -γ) V t (µ) -V t+1 (µ) -ν k+1 τ, where the first inequality is due to Lemma 5.2, the equality is due to the performance difference lemma (1) and the second inequality is due to Lemma 5.1. We use Lemma 5.1 again to bound the second term in the left-hand side of ( 6) E s∼d ⋆ µ ⟨ Q t s , π ⋆ s -π t s ⟩ = E s∼d ⋆ µ ⟨Q t s , π ⋆ s -π t s ⟩ + E s∼d ⋆ µ ⟨ Q t s -Q t s , π ⋆ s -π t s ⟩ ≥ (1 -γ) V ⋆ (µ) -V t (µ) -τ. Plugging the two bounds in (6) we obtain ν k+1 ∆ t+1 -∆ t - τ 1 -γ + ∆ t ≤ KL ⋆ t (1 -γ)η t - KL ⋆ t+1 (1 -γ)η t + τ 1 -γ , where ∆ t = V ⋆ (µ) -V t (µ). From Lemma 5.2 we have that ∆ t+1 -∆ t -τ 1-γ ≤ 0, so, since ν t+1 ≤ ν µ , we can replace ν t+1 with ν µ and write ν µ (∆ t+1 -∆ t ) + ∆ t ≤ KL ⋆ t (1 -γ)η t - KL ⋆ t+1 (1 -γ)η t + (1 + ν µ )τ 1 -γ . Rearranging and dividing by ν µ we obtain ∆ t+1 + KL ⋆ t+1 (1 -γ)ν µ η t ≤ 1 - 1 ν µ ∆ t + KL ⋆ t (1 -γ)η t (ν µ -1) + 1 + 1 ν µ τ 1 -γ . If the step sizes satisfy η t+1 (ν µ -1) ≥ η t ν µ , then ∆ t+1 + KL ⋆ t+1 (1 -γ)η t+1 (ν µ -1) ≤ 1 - 1 ν µ ∆ t + KL ⋆ t (1 -γ)η t (ν µ -1) + 2τ 1 -γ where we used that ν µ ≥ 1. The proof of the theorem follows by applying Lemma 5.3.

6. CONCLUSION

We show how unregularized NPG can be tuned to achieve linear convergence for the log-linear policy class up to an error floor that depends on the statistical error of our estimates of Q t and the bias error of the best linear approximation of Q t . Our results fill the gap between the findings in the tabular setting and the log-linear policy setting, taking advantage of a mirror-descent type analysis, and address research directions outlined in previous works (Xiao, 2022) . The main future direction is that of extending our framework and results to general policy parametrizations, as we suggest in Remark 4.8.



Assumption 4.1. (Bias error) Define the loss function L(w, θ, v) := E s,a∼v Q π θ (s, a) -w ⊤ ϕ(s, a)

respect to the squared error averaged over the distribution d t ρ . The second quantity is the shift in distribution from d t ρ to d ⋆ in the loss function. A similar conclusion for Assumption 4.1 can be drawn for the the distribution d t+1

⊤ Σw. Using Assumption 4.6 and the fact that (1 -γ)ρ ≤ d t ρ we have ∥w

