CONVERGENCE RATE OF PRIMAL-DUAL APPROACH TO CONSTRAINED REINFORCEMENT LEARNING WITH SOFTMAX POLICY

Abstract

In this paper, we consider primal-dual approach to solve constrained reinforcement learning (RL) problems, where we formulate constrained reinforcement learning under constrained Markov decision process (CMDP). We propose the primal-dual policy gradient (PD-PG) algorithm with softmax policy. Although the constrained RL involves a non-concave maximization problem over the policy parameter space, we show that for both exact policy gradient and model-free learning, the proposed PD-PG needs iteration complexity of O -2 to achieve its optimal policy for both constraint and reward performance. Such an iteration complexity outperforms or matches most constrained RL algorithms. For the learning with exact policy gradient, the main challenge is to show the positivity of deterministic optimal policy (at the optimal action) is independent on both state space and iteration times. For the model-free learning, since we consider the discounted infinite-horizon setting, and the simulator can not rollout with an infinite-horizon sequence; thus one of the main challenges lies in how to design unbiased value function estimators with finite-horizon trajectories. We consider the unbiased estimators with finite-horizon trajectories that involve geometric distribution horizons, which is the key technique for us to obtain the theoretical results for model-free learning.

1. INTRODUCTION

Reinforcement learning (RL) has achieved significant success in many fields (e.g., (Silver et al., 2017; Vinyals et al., 2019; OpenAI, 2019) ). However, most RL algorithms improve the performance under the assumption that an agent is free to explore any behaviors (that may be detrimental). For example, a robot agent should avoid playing actions that irrevocably harm its hardware (Deisenroth et al., 2013) .Thus, it is important to consider safe exploration that is known as constrained RL (or safe RL), which is usually formulated as constrained Markov decision processes (CMDP) (Altman, 1999) . The primal-dual approach (Altman, 1999; Bertsekas, 2014) is a fundamental way to solve CMDP problems. Recently, the primal-dual method has also been extended to policy gradient (e.g., (Tessler et al., 2019; Petsagkourakis et al., 2020; Xu et al., 2021) ). However, most previous work only focus on natural policy gradient (NPG) (Kakade, 2002) to solve constrained RL (e.g., (Ding et al., 2020; Xu et al., 2021; Zeng et al., 2021 )), little is known about the vanilla policy gradient (Sutton et al., 2000) with primal-dual approach to constrained RL, which involves the following foundational theoretical issues: (i) how to employ the primal-dual vanilla policy gradient method to constrained RL with exact information and model-free learning? (ii) how fast does primal-dual vanilla policy gradient converge to the optimal policy? (iii) what is the sample complexity of the primal-dual policy gradient? These questions are the focus of this paper, and we mainly consider softmax policy for the discounted infinite-horizon CMDP with finite action space and state space.

1.1. MAIN CONTRIBUTIONS

Constrained RL with Exact Policy Gradient. In Section 3, we propose a primal-dual policy gradient (PD-PG) algorithm, which improves reward performance via gradient ascent on the primal policy parameter space and plays safe explorations via projecting gradient descent on the dual space.

