LINEAR CONVERGENCE OF NATURAL POLICY GRADIENT METHODS WITH LOG-LINEAR POLICIES

Abstract

We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class. Using the compatible function approximation framework, both methods with log-linear policies can be written as inexact versions of the policy mirror descent (PMD) method. We show that both methods attain linear convergence rates and Õ(1/ 2 ) sample complexities using a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other strongly convex regularization. Lastly, as a byproduct, we obtain sublinear convergence rates for both methods with arbitrary constant step size.

1. INTRODUCTION

Policy gradient (PG) methods have emerged as a popular class of algorithms for reinforcement learning. Unlike classical methods based on (approximate) dynamic programming (e.g., Puterman, 1994; Sutton & Barto, 2018) , PG methods update directly the policy and its parametrization along the gradient direction of the value function (e.g., Williams, 1992; Sutton et al., 2000; Baxter & Bartlett, 2001) . An important variant of PG is the natural policy gradient (NPG) method (Kakade, 2001) . NPG uses the Fisher information matrix of the policy distribution as a preconditioner to improve the policy gradient direction, similar to quasi-Newton methods in classical optimization. Variants of NPG with policy parametrization through deep neural networks were shown to have impressive empirical successes (Schulman et al., 2015; Lillicrap et al., 2016; Mnih et al., 2016; Schulman et al., 2017) . Motivated by the success of NPG in practice, there is now a concerted effort to develop convergence theories for the NPG method. Neu et al. (2017) provide the first interpretation of NPG as a mirror descent (MD) method (Nemirovski & Yudin, 1983; Beck & Teboulle, 2003) . By leveraging different techniques for analyzing MD, it has been established that NPG converges to the global optimum in the tabular case (Agarwal et al., 2021; Khodadadian et al., 2021b; Xiao, 2022) and some more general settings (Shani et al., 2020; Vaswani et al., 2022; Grudzien et al., 2022) . In order to get a fast linear convergence rate for NPG, several recent works consider the regularized NPG methods, such as the entropy-regularized NPG (Cen et al., 2021) and other convex regularized NPG methods (Lan, 2022; Zhan et al., 2021) . By designing appropriate step sizes, Khodadadian et al. (2021b) and Xiao (2022) obtain linear convergence of NPG without regularization. However, all these linear convergence results are limited in the tabular setting with a direct parametrization. It remains unclear whether this same linear convergence rate can be established in function approximation settings. In this paper we provide an affirmative answer to this question for the log-linear policy class. Our approach is based on the framework of compatible function approximation (Sutton et al., 2000; Kakade, 2001) , which was extensively developed by Agarwal et al. (2021) . Using this framework, variants of NPG with log-linear policies can be written as policy mirror descent (PMD) methods with inexact evaluations of the advantage function or Q-function (giving rise to NPG or Q-NPG respectively). Then by extending a recent analysis of PMD (Xiao, 2022), we obtain a non-asymptotic linear convergence of both NPG and Q-NPG with log-linear policies. A distinctive feature of this approach is the use of a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other (strongly) convex regularization. The extensions are highly nontrivial and require quite different techniques. This linear convergence leads to the Õ(1/ 2 ) sample complexities for both methods. In particular, our sample complexity analysis also fixes errors of previous work. Lastly, as a byproduct, we obtain sublinear convergence rates for both methods with arbitrary constant step size. See Appendix A for a thorough review. In particular, Table 1 provides a complete overview of our results.

2. PRELIMINARIES ON MARKOV DECISION PROCESSES

We consider an MDP denoted as M = {S, A, P, γ}, where S is a finite state space, A is a finite action space, P : S × A → S is a Markovian transition model with P(s | s, a) being the transition probability from state s to s under action a, c is a cost function with c(s, a) ∈ [0, 1] for all (s, a) ∈ S × A, and γ ∈ [0, 1) is a discounted factor. Here we use cost instead of reward to better align with the minimization convention in the optimization literature. The agent's behavior is modeled as a stochastic policy π ∈ ∆(A) |S| , where π s ∈ ∆(A) is the probability distribution over actions A in state s ∈ S. At each time t, the agent takes an action a t ∈ A given the current state s t ∈ S, following the policy π, i.e., a t ∼ π st . Then the MDP transitions into the next state s t+1 with probability P(s t+1 | s t , a t ) and the agent encounters the cost c t = c(s t , a t ). Thus, a policy induces a distribution over trajectories {s t , a t , c t } t≥0 . In the infinite-horizon discounted setting, the cost function of π with an initial state s is defined as V s (π) def = E at∼πs t ,st+1∼P(•|st,at) ∞ t=0 γ t c(s t , a t ) | s 0 = s . Given an initial state distribution ρ ∈ ∆(S), the goal of the agent is to find a policy π that minimizes V ρ (π) def = E s∼ρ [V s (π)] = s∈S ρ s V s (π) = V (π), ρ . A more granular characterization of the performance of a policy is the state-action cost function (Q-function). For any pair (s, a) ∈ S × A, it is defined as Q s,a (π) def = E at∼πs t ,st+1∼P(•|st,at) ∞ t=0 γ t c(s t , a t ) | s 0 = s, a 0 = a . (2) Let Q s ∈ R |A| denote the vector [Q s,a ] a∈A . Then we have V s (π) = E a∼πs [Q s,a (π)] = π s , Q s (π) . The advantage function 1 is a centered version of the Q-function: A s,a (π) def = Q s,a (π) -V s (π), which satisfies E a∼πs [A s,a ] = 0 for all s ∈ S. Visitation probabilities. Given a starting state distribution ρ ∈ ∆(S), we define the state visitation distribution d π (ρ) ∈ ∆(S), induced by a policy π, as d π s (ρ) def = (1 -γ) E s0∼ρ ∞ t=0 γ t Pr π (s t = s | s 0 ) , where Pr π (s t = s | s 0 ) is the probability that s t is equal to s by following the trajectory generated by π starting from s 0 . Intuitively, the state visitation distribution measures the probability of being at state s across the entire trajectory. We define the state-action visitation distribution d π (ρ) ∈ ∆(S × A) as d π s,a (ρ) def = d π s (ρ)π s,a = (1 -γ) E s0∼ρ ∞ t=0 γ t Pr π (s t = s, a t = a | s 0 ) . In addition, we extend the definition of d π (ρ) by specifying the initial state-action distribution ν, i.e., d π s,a (ν) def = (1 -γ) E (s0,a0)∼ν ∞ t=0 γ t Pr π (s t = s, a t = a | s 0 , a 0 ) . The difference in the last two definitions is that for the former, the initial action a 0 is sampled directly from π, whereas for the latter, it is prescribed by the initial state-action distribution ν. We use d compared to d to better distinguish the cases with ν and ρ. Without specification, we even omit the argument ν or ρ throughout the paper to simplify the presentation as they are self-evident. From these definitions, we have for all (s, a) ∈ S × A, (5)



An advantage function should measure how much better is a compared to π, while here A is positive when a is worse than π. We keep calling A advantage function to better align with the convention in the RL literature.



d π s ≥ (1 -γ)ρ s , d π s,a ≥ (1 -γ)ρ s π s,a , d π s,a ≥ (1 -γ)ν s,a .

