CONVERGENCE RATE OF PRIMAL-DUAL APPROACH TO CONSTRAINED REINFORCEMENT LEARNING WITH SOFTMAX POLICY

Abstract

In this paper, we consider primal-dual approach to solve constrained reinforcement learning (RL) problems, where we formulate constrained reinforcement learning under constrained Markov decision process (CMDP). We propose the primal-dual policy gradient (PD-PG) algorithm with softmax policy. Although the constrained RL involves a non-concave maximization problem over the policy parameter space, we show that for both exact policy gradient and model-free learning, the proposed PD-PG needs iteration complexity of O -2 to achieve its optimal policy for both constraint and reward performance. Such an iteration complexity outperforms or matches most constrained RL algorithms. For the learning with exact policy gradient, the main challenge is to show the positivity of deterministic optimal policy (at the optimal action) is independent on both state space and iteration times. For the model-free learning, since we consider the discounted infinite-horizon setting, and the simulator can not rollout with an infinite-horizon sequence; thus one of the main challenges lies in how to design unbiased value function estimators with finite-horizon trajectories. We consider the unbiased estimators with finite-horizon trajectories that involve geometric distribution horizons, which is the key technique for us to obtain the theoretical results for model-free learning.

1. INTRODUCTION

Reinforcement learning (RL) has achieved significant success in many fields (e.g., (Silver et al., 2017; Vinyals et al., 2019; OpenAI, 2019) ). However, most RL algorithms improve the performance under the assumption that an agent is free to explore any behaviors (that may be detrimental). For example, a robot agent should avoid playing actions that irrevocably harm its hardware (Deisenroth et al., 2013) .Thus, it is important to consider safe exploration that is known as constrained RL (or safe RL), which is usually formulated as constrained Markov decision processes (CMDP) (Altman, 1999) . The primal-dual approach (Altman, 1999; Bertsekas, 2014) is a fundamental way to solve CMDP problems. Recently, the primal-dual method has also been extended to policy gradient (e.g., (Tessler et al., 2019; Petsagkourakis et al., 2020; Xu et al., 2021) ). However, most previous work only focus on natural policy gradient (NPG) (Kakade, 2002) to solve constrained RL (e.g., (Ding et al., 2020; Xu et al., 2021; Zeng et al., 2021) ), little is known about the vanilla policy gradient (Sutton et al., 2000) with primal-dual approach to constrained RL, which involves the following foundational theoretical issues: (i) how to employ the primal-dual vanilla policy gradient method to constrained RL with exact information and model-free learning? (ii) how fast does primal-dual vanilla policy gradient converge to the optimal policy? (iii) what is the sample complexity of the primal-dual policy gradient? These questions are the focus of this paper, and we mainly consider softmax policy for the discounted infinite-horizon CMDP with finite action space and state space.

1.1. MAIN CONTRIBUTIONS

Constrained RL with Exact Policy Gradient. In Section 3, we propose a primal-dual policy gradient (PD-PG) algorithm, which improves reward performance via gradient ascent on the primal policy parameter space and plays safe explorations via projecting gradient descent on the dual space. Theorem 2 shows that PD-PG with exact policy gradient needs the iteration complexity of O d ρ0 π ρ 0 2 ∞ |S| log |A| c (1 -γ) 4 2 (1) to obtain the O( )-optimality, where c is the infimum of the probability of the optimal action from softmax policy, c is a positive scalar independent on the time-step t and independent on the state space S. One of the main challenges to obtain the complexity ( 1) is to show that c is bounded away from 0, see Proposition 2. From Table 1 , we know the proposed PD-PG is with the iteration complexity of O( -2 ), which is comparable to extensive constrained RL algorithms. Model-Free Constrained RL. In Section 4, we propose a sample-based PD-PG that only uses empirical data to learn a safe policy. The sample-based PD-PG needs a complexity of O   d ρ0 π ρ 0 2 ∞ |S| |S||A| + m log |A| c (1 -γ) 4 2   (2) to obtain the O( )-optimality, where m is the number of constraints. The iteration complexity (2) outperforms or matches extensive existing state-of-the-art constrained RL algorithms, see Table 1 . Since this work considers discounted infinite-horizon CMDP, and the simulator can not rollout with an infinite-horizon sequence; thus the main challenge lies in designing unbiased value function estimators with finite-horizon trajectories. In Section 4.2, according to Paternain (2018, Chapter 6), we introduce unbiased estimators with finite-horizon trajectories that involve geometric distribution horizons, which plays a critical role for us to obtain the iteration complexity of sample-based PD-PG. Finally, in Section 4.6, we also illustrate an iteration complexity trade-off between PD-PG and NPD-PG (Ding et al., 2020) , where we analyze it from the trade-off between the distribution mismatch coefficient d ρ 0 π ρ0 ∞ (contained in the proposed PD-PG) and the Moore-Penrose pseudo inverse Fisher information matrix F † (θ) (contained in NPD-PG (Ding et al., 2020) ).

1.2. RELATED WORK

Constrained RL with Exact Policy Gradient. The proposed PD-PG (Algorithm 1) is Lagrangianbased CMDP algorithm (Borkar, 2005; Bhatnagar & Lakshmanan, 2012; Liang et al., 2018; Tessler et al., 2019; Yu et al., 2019; Chow et al., 2017; Koppel et al., 2019; Miryoosefi et al., 2019; Paternain et al., 2019a; b) . However, those work only focus on the asymptotic convergence results. Primal-dual method is extended with policy gradient (e.g., (Borkar, 2005; Bhatnagar & Lakshmanan, 2012; Tessler et al., 2019; Petsagkourakis et al., 2020; Wachi et al., 2021) ), but those work focus on natural policy gradient (NPG) with Fisher information (Kakade, 2002) or regularized policy iteration to solve constrained RL problems (e.g., (Bharadhwaj et al., 2021) ). It is still known litter about vanilla policy gradient (Williams, 1992; Sutton et al., 2000) with primal-dual approach (i.e., the proposed PD-PG) to constrained RL. This work studies the finite-sample performance of the vanilla PD-PG. From Table 1 we know expect UCBVI-γ (He et al., 2021) outperforms PD-PG by a factor 1 1-γ , PD-PG is comparable to extensive existing state-of-the-art CMDP algorithms. Model-Free Constrained RL. Model-free constrained RL algorithms, including CPO (Achiam et al., 2017) , IPO (Liu et al., 2020 ), Lyapunov-Based Safe RL (Chow et al., 2018) , SAILR (Wagener et al., 2021) , SPRL (Sohn et al., 2021) , SNO-MDP (Wachi & Sui, 2020) -4 ) sample complexity to obtain O( )-optimality 7 . Finally, from Table 1 we know PD-PG achieves the best sample complexity among the policy-based safe RL algorithms.



, A-CRL(Calvo-Fullana et al.,  2021)  and DCRL(Qin et al., 2021)  all lack convergence rate analysis.Recently, Ding et al. (2020) propose the natural policy gradient primal-dual (NPD-PG) method for solving discounted infinite-horizon CMDP. Even though the underlying maximization involves a nonconcave objective function and a nonconvex constraint setting under the softmax policy parametrization, Ding et al. (2020) show NPD-PG converges at sublinear rates regarding both the optimality gap and the constraint violation, which shares a similar iteration complexity as the proposed PD-PG. Later, Zeng et al. (2021) extend the critical idea of NPD-PG, propose an online version of NPD-PG, and show their algorithm needs the sample complexity of O( -6 ). Xu et al. (2021) propose a primal-type algorithmic framework to solve SRL problems, and Xu et al. (2021) show it needs O(

