LINEAR CONVERGENCE FOR NATURAL POLICY GRADI-ENT WITH LOG-LINEAR POLICY PARAMETRIZATION

Abstract

We analyze the convergence rate of the unregularized natural policy gradient algorithm with log-linear policy parametrizations in infinite-horizon discounted Markov decision processes. In the deterministic case, when the Q-value is known and can be approximated by a linear combination of a known feature function up to a bias error, we show that a geometrically-increasing step size yields a linear convergence rate towards an optimal policy. We then consider the sample-based case, when the best representation of the Q-value function among linear combinations of a known feature function is known up to an estimation error. In this setting, we show that the algorithm enjoys the same linear guarantees as in the deterministic case up to an error term that depends on the estimation error, the bias error, and the condition number of the feature covariance matrix. Our results build upon the general framework of policy mirror descent and extend previous findings for the softmax tabular parametrization to the log-linear policy class.

1. INTRODUCTION

Sequential decision-making represents a framework of paramount importance in modern statistics and machine learning. In this framework, an agent sequentially interacts with an environment to maximize notions of reward. In these interactions, an agent observes its current state s ∈ S, takes an action a ∈ A according to a policy that associates to each state a probability distribution over actions, receives a reward, and transitions to a new state. Reinforcement Learning (RL) focuses on the case where the agent does not have complete knowledge of the environment dynamics. One of the most widely-used classes of algorithms for RL is represented by policy optimization. In policy optimization algorithms, an agent iteratively updates a policy that belongs to a given parametrized class with the aim of maximizing the expected sum of discounted rewards, where the expectation is taken over the trajectories induced by the policy. Many types of policy optimization techniques have been explored in the literature, such as policy gradient methods (Sutton et al., 1999) , natural policy gradient methods (Kakade, 2002) , trust region policy optimization (Schulman et al.) , and proximal policy optimization (Schulman et al., 2017) . Thanks to the versatility of the policy parametrization framework, in particular the possibility of incorporating flexible approximation schemes such as neural networks, these methods have been successfully applied in many settings. However, a complete theoretical justification for the success of these methods is still lacking. The simplest and most understood setting for policy optimization is the tabular case, where both the state space S and the action space A are finite and the policy has a direct parametrization, i.e. it assigns a probability to each state-action pair. This setting has received a lot of attention in recent years and has seen several developments (Agarwal et al., 2021; Xiao, 2022) . Its analysis is particularly convenient due to the decoupled nature of the parametrization, where the probability distribution over the action space that the policy assigns to each state can be updated and analyzed separately for each state. This leads to a simplified analysis, where it is often possible to drop discounted visitation distribution terms in the policy update and take advantage of the contractivity property typical of value and policy iteration methods. Recent results involve, in particular, natural policy gradient (NPG) and, more generally, policy mirror descent, showing how specific choices of learning rates yield linear convergence to the optimal policy for several formulations and variations of these algorithms (Cen et al., 2021; Zhan et al., 2021; Khodadadian et al.; Xiao, 2022; Li et al., 2022; Lan, 2022; Bhandari and Russo, 2021; Mei et al., 2020) . Two of the main shortfalls of these methods are their computational and sample complexities, which depend polynomially on the cardinality of the state and action spaces, even in the case of linear convergence. Indeed, by design, these algorithms need to update at each iteration a parameter or a probability for all state-action pairs, which has an operation cost proportional to |S||A|. Furthermore, in order to preserve linear convergence in the sample-based case, the aforementioned works assume that the worst estimate (ℓ ∞ norm) of Q π (s, a)-which is the expected sum of discounted rewards starting from the state-action pair (s, a) and following a policy π-is exact up to a given error threshold. Without further assumptions, meeting this threshold requires a number of samples that depends polynomially on |S||A|. A promising approach to deal with large and high-dimensional spaces that is recently being explored is that of assuming that the environment has a low-rank structure and that, as a consequence, it can be described or approximated by a lower dimensional representation. In particular, a popular framework is that of linear function approximation, which consists in assuming that quantities of interest in the problem formulation, such as the transition probability (Linear MDPs) or the action-value function Q π of a policy π can be approximated by a linear combination of a certain d-dimentional feature function ϕ : S × A → R d up to a bias error ε bias . This linear assumption reduces the dimensionality of the problem to that of the feature function. In this setting, many researchers have proposed methods to learn the best representation ϕ (Agarwal et al., 2020; Modi et al., 2021; Uehara et al., 2021; Zhang et al., 2022) and to exploit it to design efficient vairations of the upper confidence bound (UCB) algorithm, for instance (Jin et al., 2020; Li et al., 2021; Wagenmaker et al., 2022) . When applying the framework of linear approximation to policy optimization, researchers typically adopt the log-linear policy class, where a policy π θ parametrized by θ ∈ R d is defined as proportional to exp(θ ⊤ ϕ). For this policy class, several works have obtained improvements in terms of computational and sample complexity, as the policy update requires a number of operations that scales only with the feature dimension d and the estimation assumption to retain convergence rates in the sample-based setting is weaker than the tabular counterpart. In fact, theoretical guarantees for these algorithms only assume the expectation of Q π over a known distribution on the state and action spaces to be exact up to a statistical error ε stat . In the linear function approximation setting meeting this assumption typically requires a number of samples that is only a function of d and it does not depend on |S| and |A| Telgarsky (2022). However, a complete understanding of the convergence rate of policy optimization methods in this setting is still missing. Recent results include sublinear convergence rates for unregularized NPG (Agarwal et al., 2021; Qiu et al., 2021; Zanette et al., 2021; Hu et al., 2021 ) and linear convergence rates for entropy-regularized NPG with bounded updates (Cayci et al., 2021) . Our work fills the gap between the aforementioned findings and it extends the analysis and results of the tabular setting to the linear function approximation setting. In particular, we show that, under the standard assumptions on the (ε stat , ε bias )-approximation of Q π mentioned above, a choice of geometrically-increasing step-sizes leads to linear convergence of NPG for the log-linear policy class in both deterministic and sample-based settings. Our result directly improves upon the sublinear iteration complexity of NPG previously established for the log-linear policy class by Agarwal et al. (2021) and Hu et al. (2021) and it removes the need for entropy regularization and bounded step-sizes used by Cayci et al. (2021) , under the same assumptions on the linear approximation of Q π . Moreover, we have that the number of operations needed for the policy update and the number of samples needed to preserve the convergence rate in the sample-based setting depend on the dimension d of ϕ, as opposed to the tabular setting where the same quantities depend on |S||A|. By extending the linear convergence rate of NPG from the tabular softmax parametrization to the setting of log-linear policy parametrizations, our result directly addresses the research direction outlined in the conclusion of Xiao (2022), and it overcomes the aforementioned limitations of the tabular settings. Our analysis is based on the equivalence of NPG and policy mirror descent with KL divergence (Raskutti and Mukherjee, 2015), which has been exploited for applying mirror-descent-type analysis to NPG by several works, such as Agarwal et al. (2021); Hu et al. (2021); Cayci et al. (2021) . The advantages of this equivalence are twofold. Firstly, NPG crucially ensures a simple update rule, i.e. log π t+1 (a|s) = log π t (a|s) + η t Q πt (s, a) , which in the particular case of the log-linear policy class translates into θ ⊤ t+1 ϕ(s, a) = θ ⊤ t ϕ(s, a) + η t Q πt (s, a). Secondly, the mirror descent setup is particularly useful to iteratively control the updates and the approximation errors,

