LINEAR CONVERGENCE FOR NATURAL POLICY GRADI-ENT WITH LOG-LINEAR POLICY PARAMETRIZATION

Abstract

We analyze the convergence rate of the unregularized natural policy gradient algorithm with log-linear policy parametrizations in infinite-horizon discounted Markov decision processes. In the deterministic case, when the Q-value is known and can be approximated by a linear combination of a known feature function up to a bias error, we show that a geometrically-increasing step size yields a linear convergence rate towards an optimal policy. We then consider the sample-based case, when the best representation of the Q-value function among linear combinations of a known feature function is known up to an estimation error. In this setting, we show that the algorithm enjoys the same linear guarantees as in the deterministic case up to an error term that depends on the estimation error, the bias error, and the condition number of the feature covariance matrix. Our results build upon the general framework of policy mirror descent and extend previous findings for the softmax tabular parametrization to the log-linear policy class.

1. INTRODUCTION

Sequential decision-making represents a framework of paramount importance in modern statistics and machine learning. In this framework, an agent sequentially interacts with an environment to maximize notions of reward. In these interactions, an agent observes its current state s ∈ S, takes an action a ∈ A according to a policy that associates to each state a probability distribution over actions, receives a reward, and transitions to a new state. Reinforcement Learning (RL) focuses on the case where the agent does not have complete knowledge of the environment dynamics. One of the most widely-used classes of algorithms for RL is represented by policy optimization. In policy optimization algorithms, an agent iteratively updates a policy that belongs to a given parametrized class with the aim of maximizing the expected sum of discounted rewards, where the expectation is taken over the trajectories induced by the policy. Many types of policy optimization techniques have been explored in the literature, such as policy gradient methods (Sutton et al., 1999) , natural policy gradient methods (Kakade, 2002) , trust region policy optimization (Schulman et al.), and proximal policy optimization (Schulman et al., 2017) . Thanks to the versatility of the policy parametrization framework, in particular the possibility of incorporating flexible approximation schemes such as neural networks, these methods have been successfully applied in many settings. However, a complete theoretical justification for the success of these methods is still lacking. The simplest and most understood setting for policy optimization is the tabular case, where both the state space S and the action space A are finite and the policy has a direct parametrization, i.e. it assigns a probability to each state-action pair. This setting has received a lot of attention in recent years and has seen several developments (Agarwal et al., 2021; Xiao, 2022) . Its analysis is particularly convenient due to the decoupled nature of the parametrization, where the probability distribution over the action space that the policy assigns to each state can be updated and analyzed separately for each state. This leads to a simplified analysis, where it is often possible to drop discounted visitation distribution terms in the policy update and take advantage of the contractivity property typical of value and policy iteration methods. Recent results involve, in particular, natural policy gradient (NPG) and, more generally, policy mirror descent, showing how specific choices of learning rates yield linear convergence to the optimal policy for several formulations and variations of these algorithms (Cen et al., 2021; Zhan et al., 2021; Khodadadian et al.; Xiao, 2022; Li et al., 2022; Lan, 2022; Bhandari and Russo, 2021; Mei et al., 2020) .

