ROBUST CONSTRAINED REINFORCEMENT LEARNING

Abstract

Constrained reinforcement learning is to maximize the reward subject to constraints on utilities/costs. However, in practice it is often the case that the training environment is not the same as the test one, due to, e.g., modeling error, adversarial attack, non-stationarity, resulting in severe performance degradation and more importantly constraint violation in the test environment. To address this challenge, we formulate the framework of robust constrained reinforcement learning under model uncertainty, where the MDP is not fixed but lies in some uncertainty set. The goal is two fold: 1) to guarantee that constraints on utilities/costs are satisfied for all MDPs in the uncertainty set, and 2) to maximize the worst-case reward performance over the uncertainty set. We design a robust primal-dual approach, and further develop theoretical guarantee on its convergence, complexity and robust feasibility. We then investigate a concrete example of δ-contamination uncertainty set, design an online and model-free algorithm and theoretically characterize its sample complexity.

1. INTRODUCTION

In many practical reinforcement learning (RL) applications, it is critical for an agent to meet certain constraints on utilities/costs while maximizing the reward. This problem is usually modeled as the constrained Markov decision processes (CMDPs) (Altman, 1999) . Consider a CMDP with state space S, action space A, transition kernel P = {p a s ∈ ∆ S 1 : s ∈ S, a ∈ A}, reward and utility functions: r, c i : S × A → [0, 1], 1 ≤ i ≤ m, and discount factor γ. The goal of CMDP is to find a stationary policy π : S → ∆ A that maximizes the expected reward subject to constraints on the utility: max π∈Π E π,P ∞ t=0 γ t r(S t , A t )|S 0 ∼ ρ , s.t. E π,P ∞ t=0 γ t c i (S t , A t )|S 0 ∼ ρ ≥ b i , 1 ≤ i ≤ m, (1) where ρ is the initial state distribution, b i 's are some thresholds and E π,P denotes the expectation when the agent follows policy π and the environment transits following P. In practice, it is often the case that the environment on which the learned policy will deploy (the test environment) possibly deviates from the training one, due to, e.g., modeling error of the simulator, adversarial attack, and non-stationarity. This could lead to a significant performance degradation in reward, and more importantly, constraints may not be satisfied anymore, which is severe in safetycritical applications. For example, a drone may run out of battery and crash due to mismatch between training and test environments. This hence motivates the study of robust constrained RL in this paper. In this paper, we take a pessimistic approach in face of uncertainty. Specifically, consider a set of transition kernels P, which is usually constructed in a way to include the test environment with high probability (Iyengar, 2005; Nilim & El Ghaoui, 2004; Bagnell et al., 2001) . The learned policy should satisfy the constraints under all these environments in P, i.e., ∀P ∈ P, E π,P ∞ t=0 γ t c i (S t , A t )|S 0 ∼ ρ ≥ b i , which is equivalent to min P∈P E π,P [ ∞ t=0 γ t c(S t , A t )|S 0 ∼ ρ] ≥ b i . At the same time, we aim to optimize the worst-case reward performance over P: max π∈Π min P∈P E π,P ∞ t=0 γ t r(S t , A t )|S 0 ∼ ρ , 1 ∆ X denotes the probability simplex supported on the set X.

s.t. min

P∈P E π,P ∞ t=0 γ t c i (S t , A t )|S 0 ∼ ρ ≥ b i , 1 ≤ i ≤ m. (3) On one hand, a feasible solution to eq. ( 3) always satisfies eq. ( 2), and on the other hand, the solution to eq. ( 3) provides a performance guarantee for any P ∈ P. We note that our approach and analysis can also be applied to the optimistic approach in face of uncertainty. In this paper, we design and analyze a robust primal-dual algorithm for the problem of robust constrained RL. In particular, the technical challenges and our major contributions are as follows. • We take the Lagrange multiplier method to solve the constrained policy optimization problem. A first question is that whether the primal problem is equivalent to the dual problem, i.e., whether the duality gap is zero. For non-robust constrained RL, the Lagrange function has a zero duality gap (Paternain et al., 2019; Altman, 1999) . However, we show that this is not necessarily true in the robust constrained setting. Note that the set of visitation distribution being convex is one key property to show zero duality gap of constrained MDP (Altman, 1999; Paternain et al., 2019) . In this paper, we constructed a novel counter example showing that the set of robust visitation distributions for our robust problem is non-convex. • In the dual problem of non-robust CMDPs, the sum of two value functions is actually a value function of the combined reward. However, this does not hold in the robust setting, since the worstcase transition kernels for the two robust value functions are not necessarily the same. Therefore, the geometry of our Lagrangian function is much more complicated. In this paper, we formulate the dual problem of the robust constrained RL problem as a minimax linear-nonconcave optimization problem, and show that the optimal dual variable is bounded. We then construct a robust primaldual algorithm by alternatively updating the primal and dual variables. We theoretically prove the convergence to stationary points, and characterize its complexity. • In general, convergence to stationary points of the Lagrangian function does not necessarily imply that the solution is feasible (Lin et al., 2020; Xu et al., 2020) . We design a novel proof to show that the gradient belongs to the normal cone of the feasible set, based on which we further prove the robust feasibility of the obtained policy. • We apply and extend our results on an important uncertainty set referred to as δ-contamination model (Huber, 1965) . Under this model, the robust value functions are not differentiable and we hence propose a smoothed approximation of the robust value function towards a better geometry. We further investigate the practical online and model-free setting and design an actor-critic type algorithm. We also establish its convergence, sample complexity, and robust feasibility. We then discuss works related to robust constrained RL. Robust constrained RL. In (Russel et al., 2020) , the robust constrained RL problem was studied, and a heuristic approach was developed. The basic idea is to estimate the robust value functions, and then to use the vanilla policy gradient method (Sutton et al., 1999) with the vanilla value function replaced by the robust value function. However, this approach did not take into consideration the fact that the worst-case transition kernel is also a function of the policy (see Section 3.1 in (Russel et al., 2020)), and therefore the "gradient" therein is not actually the gradient of the robust value function. Thus, its performance and convergence cannot be theoretically guaranteed. The other work (Mankowitz et al., 2020) studied the same robust constrained RL problem under the continuous control setting, and proposed a similar heuristic algorithm. They first proposed a robust Bellman operator and used it to estimate the robust value function, which is further combined with some non-robust continuous control algorithm to update the policy. Both approaches in (Russel et al., 2020) and (Mankowitz et al., 2020) inherit the heuristic structure of "robust policy evaluation" + "non-robust vanilla policy improvement", which may not necessarily guarantee an improved policy in general. In this paper, we employ a "robust policy evaluation" + "robust policy improvement" approach, which guarantees an improvement in the policy, and more importantly, we provide theoretical convergence guarantee, robust feasibility guarantee, and complexity analysis for our algorithms. Constrained RL. The most commonly used method for constrained RL is the primal-dual method (Altman, 1999; Paternain et al., 2019; 2022; Liang et al., 2018; Stooke et al., 2020; Tessler et al., 2018; Yu et al., 2019; Zheng & Ratliff, 2020; Efroni et al., 2020; Auer et al., 2008) , which augments the objective with a sum of constraints weighted by their corresponding Lagrange multipliers, and then alternatively updates the primal and dual variables. It was shown that the strong duality holds for constrained RL, and hence the primal-dual method has zero duality gap (Paternain et al., 2019; Altman, 1999) . The convergence rate of the primal-dual method was investigated in (Ding et al., 2020; 2021; Li et al., 2021b; Liu et al., 2021; Ying et al., 2021) . Another class of method is the primal method, which is to enforce the constraints without resorting to the Lagrangian formulation (Achiam et al., 2017; Liu et al., 2020; Chow et al., 2018; Dalal et al., 2018; Xu et al., 2021; Yang et al., 2020) . The above studies, when directly applied to robust constrained RL, cannot guarantee the constraints when there is model deviation. Moreover, the objective and constraints in this paper take min over the uncertainty set (see eq. ( 4)), and therefore have much more complicated geometry than the non-robust case. Robust RL under model uncertainty. Model-based robust RL was firstly introduced and studied in (Iyengar, 2005; Nilim & El Ghaoui, 2004; Bagnell et al., 2001; Satia & Lave Jr, 1973; Wiesemann et al., 2013; Lim & Autef, 2019; Xu & Mannor, 2010; Yu & Xu, 2015; Lim et al., 2013; Tamar et al., 2014) , where the uncertainty set is assumed to be known, and the problem can be solved using robust dynamic programming. It was then extended to the model-free setting, where the uncertainty set is unknown, and only samples from its centroid can be collected (Roy et al., 2017; Wang & Zou, 2021; 2022; Zhou et al., 2021; Yang et al., 2021; Panaganti & Kalathil, 2021; Ho et al., 2018; 2021) . There are also empirical studies on robust RL, e.g., (Vinitsky et al., 2020; Pinto et al., 2017; Abdullah et al., 2019; Hou et al., 2020; Rajeswaran et al., 2017; Huang et al., 2017; Kos & Song, 2017; Lin et al., 2017; Pattanaik et al., 2018; Mandlekar et al., 2017) . These works focus on robust RL without constraints, whereas in this paper we investigate robust RL with constraints, which is more challenging. There is a related line of works on (robust) imitation learning (Ho & Ermon, 2016; Fu et al., 2017; Torabi et al., 2018; Viano et al., 2022) , which can be formulated as a constrained problem. But their problem settings and approaches are fundamentally different from ours.

2. PRELIMINARIES

Constrained MDP. Consider the CMDP problem in eq. ( 1). Define the visitation distribution induced by policy π and transition kernel P: d π ρ,P (s, a) = (1 -γ) ∞ t=0 γ t P(S t = s, A t = a|S 0 ∼ ρ, π, P). It can be shown that the set of the visitation distributions of all policies {d π ρ,P ∈ ∆ S×A : π ∈ Π} is convex (Paternain et al., 2022; Altman, 1999) . Based on this convexity, the strong duality of CMDP can be established (Altman, 1999; Paternain et al., 2019) under a standard assumption referred as Slater's condition: (Bertsekas, 2014; Ding et al., 2021) : there exists a constant ζ > 0 and a policy π ∈ Π s.t. ∀i, V π ci,P -b i ≥ ζ. Robust MDP. In this paper, we focus on the (s, a)-rectangular uncertainty set (Nilim & El Ghaoui, 2004; Iyengar, 2005) , i.e., P = s,a P a s , where P a s ⊆ ∆ S . At each time step, the environment transits following a transition kernel belonging to the uncertainty set P t ∈ P. The robust value function of a policy π is then defined as the worst-case expected accumulative discounted reward following policy π over all MDPs in the uncertainty set (Nilim & El Ghaoui, 2004; Iyengar, 2005) : V π r,P (s) ≜ min κ=(P0,P1,...)∈ t≥0 P E κ ∞ t=0 γ t r(S t , A t )|S 0 = s, π , where E κ denotes the expectation when the state transits according to κ. It was shown that the robust value function is the fixed point of the robust Bellman operator (Nilim & El Ghaoui, 2004; Iyengar, 2005; Puterman, 2014) : T π V (s) ≜ a∈A π(a|s) r(s, a) + γσ P a s (V ) , where σ P a s (V ) ≜ min p∈P a s p ⊤ V is the support function of V on P a s . Note that the minimizer of eq. ( 4), κ * , is stationary in time (Iyengar, 2005) , which we denote by κ * = {P π , P π , ...}, and refer to P π as the worst-case transition kernel. Then the robust value function V π r,P is actually the value function under policy π and transition kernel P π . The goal of robust RL is to find the optimal robust policy π * that maximizes the worst-case accumulative discounted reward: π * = arg max π V π r,P (s), ∀s ∈ S.

3. ROBUST CONSTRAINED RL

Recall the robust constrained RL formulated in eq. ( 3): max θ∈Θ V π θ r (ρ), s.t. V π θ ci (ρ) ≥ b i , 1 ≤ i ≤ m, where for simplicity we omit the subscript P in V π θ ⋄,P and denote by V π θ ci (ρ) and V π θ r (ρ) the robust value function for c i and r under π θ . The goal of eq. ( 5) is to find a policy that maximizes the robust reward value function among those feasible solutions. Here, any feasible solution to eq. ( 5) can guarantee that under any MDP in the uncertainty set, its accumulative discounted utility is always no less than b i , which guarantees robustness to constraint violation under model uncertainty. Furthermore, the optimal solution to eq. ( 5) achieves the best "worst-case reward performance" among all feasible solutions. If we use the optimal solution to eq. ( 5), then under any MDP in the uncertainty set, we have a guaranteed reward no less than the value of eq. ( 5). In this paper, we focus on the parameterized policy class, i.e., π θ ∈ Π Θ , where Θ ⊆ R d is a parameter set and Π Θ is a class of parameterized policies, e.g., direct parameterized policy, softmax or neural network policy. For technical convenience, we adopt a standard assumption on the policy class. Assumption 1. The policy class Π Θ is k-Lipschitz and l-smooth, i.e., for any s ∈ S and a ∈ A and for any θ ∈ Θ, there exist universal constants k, l, such that ∥∇π θ (a|s)∥ ≤ k, and ∥∇ 2 π θ (a|s)∥ ≤ l. This assumption can be satisfied by many policy classes, e.g., direct parameterization (Agarwal et al., 2021) , soft-max (Mei et al., 2020; Li et al., 2021a; Wang & Zou, 2020) , or neural network with Lipschitz and smooth activation functions (Du et al., 2019; Neyshabur, 2017; Miyato et al., 2018) . The problem eq. ( 5) is equivalent to the following max-min problem: max θ∈Θ min λi≥0 V π θ r (ρ) + m i=1 λ i (V π θ ci (ρ) -b i ). Unlike non-robust CMDP, strong duality for robust constrained RL may not hold. For robust RL, the robust value function can be viewed as the value function for policy π under its worst-case transition kernel P π , and therefore can be written as the inner product between the reward (utility) function and the visitation distribution induced by π and P π (referred to as robust visitation distribution of π). The following lemma shows that the set of robust visitation distributions may not be convex, and therefore, the approach used in (Altman, 1999; Paternain et al., 2019) to show strong duality cannot be applied here. Lemma 1. There exists a robust MDP, such that the set of robust visitation distributions is non-convex. In the following, we focus on the dual problem of eq. ( 6). For simplicity, we investigate the case with one constraint, and extension to the case with multiple constraints is straightforward: min λ≥0 max θ∈Θ V π θ r (ρ) + λ(V π θ c (ρ) -b). We make an assumption of Slater's condition, assuming there exists at least one strictly feasible policy (Bertsekas, 2014; Ding et al., 2021) , under which, we further show that the optimal dual variable of eq. ( 7) is bounded. This assumption is Assumption 2. There exists ζ > 0 and a policy π ∈ Π Θ , s.t. V π c (ρ) -b ≥ ζ. Lemma 2. Denote the optimal solution of eq. ( 7) by (λ * , π θ * ). Then, λ * ∈ 0, 2 ζ(1-γ) . Lemma 2 suggests that the dual problem eq. ( 7) is equivalent to a bounded min-max problem: min λ∈[0, 2 ζ(1-γ) ] max θ∈Θ V π θ r (ρ) + λ(V π θ c (ρ) -b). The problem in eq. ( 8) is a bounded linear-nonconcave optimization problem. We then propose our robust primal-dual algorithm for robust constrained RL in Algorithm 1. The basic idea of Algorithm 1 is to perform gradient descent-ascent w.r.t. λ and θ alternatively. When the policy π violates the constraint, the dual variable λ increases such that λV π c dominates V π r . Then the gradient ascent will update θ until the policy satisfies the constraint. Therefore, this approach is expected to find a feasible policy (as will be shown in Lemma 5). Here, X (x) denotes the projection of x to the set X, and {b t } is a non-negative monotone decreasing sequence, which will be specified later. Algorithm 1 reduces to the vanilla gradient descent-ascent algorithm in (Lin et al., 2020) if b t = 0. However, b t is critical to the convergence of Algorithm 1 (Xu et al., 2020) . The outer problem of eq. ( 8) is Algorithm 1 Robust Primal-Dual algorithm (RPD) Input: T , α t , β t , b t Initialization: λ 0 , θ 0 for t = 0, 1, ..., T -1 do λ t+1 ← [0,Λ * ] λ t -1 βt V π θ t c (ρ) -b -bt βt λ t θ t+1 ← Θ θ t + 1 αt ∇ θ V π θ t r (ρ) + λ t+1 ∇ θ V π θ t c (ρ) end for Output: θ T actually linear, and after introducing b t , the update of λ t can be viewed as a gradient descent of a strongly-convex function λ(V c -b) + bt 2 λ 2 , which converges more stable and faster. Denote that Lagrangian function by V L (θ, λ) ≜ V π θ r (ρ) + λ(V π θ c (ρ) -b) , and further denote the gradient mapping of Algorithm 1 by G t ≜   β t λ t -[0,Λ * ] λ t -1 βt ∇ λ V L (θ t , λ t ) α t θ t -Θ θ t + 1 αt ∇ θ V L (θ t , λ t )   . ( ) The gradient mapping is a standard measure of convergence for projected optimization approaches (Beck, 2017) . Intuitively, it reduces to the gradient (∇ λ V L , ∇ θ V L ), when Λ * = ∞ and Θ = R d , and it measures the updates of θ and λ at time step t. If ∥G t ∥ → 0, the updates of both variables are small, and hence the algorithm converges to a stationary solution. To show the convergence of Algorithm 1, we make the following Lipschitz smoothness assumption. Assumption 3. The gradients of the Lagrangian function are Lipschitz: ∥∇ λ V L (θ, λ)| θ1 -∇ λ V L (θ, λ)| θ2 ∥ ≤ L 11 ∥θ 1 -θ 2 ∥, ( ) ∥∇ λ V L (θ, λ)| λ1 -∇ λ V L (θ, λ)| λ2 ∥ ≤ L 12 |λ 1 -λ 2 |, ( ) ∥∇ θ V L (θ, λ)| θ1 -∇ θ V L (θ, λ)| θ2 ∥ ≤ L 21 ∥θ 1 -θ 2 ∥, ( ) ∥∇ θ V L (θ, λ)| λ1 -∇ θ V L (θ, λ)| λ2 ∥ ≤ L 22 |λ 1 -λ 2 |. ( ) As will be shown in Section 4, Assumption 3 can be satisfied with a smoothed approximation of the robust value function. In the following theorem, we show that our robust primal-dual algorithm converges to a stationary point of the min-max problem eq. ( 16), with a complexity of O(ϵ -4 ). Theorem 1. Under Assumption 3, if we set step sizes α t , β t , and b t as in Section J and T = O(ϵ -4 ), then min 1≤t≤T ∥G t ∥ ≤ 2ϵ. The next proposition characterizes the feasibility of the obtained policy. Proposition 1. Denote by W ≜ arg min 1≤t≤T ∥G t ∥. If λ W -1 β W ∇ λ V L σ (θ W , λ W ) ∈ [0, Λ * ), then π W satisfies the constraint with a 2ϵ-violation. In general, convergence to stationary points of the Lagrangian function does not necessarily imply that the solution is feasible. Proposition 1 shows that Algorithm 1 always return a policy that is robust feasible, i.e., satisfying the constraints in eq. ( 5). Intuitively, if we set Λ * larger so that the optimal solution λ * ∈ [0, Λ * ), then Algorithm 1 is expected to converge to an interior point of [0, Λ * ] and therefore, π W is feasible. On the other hand, Λ * can't be set too large. Note that the complexity in Theorem 1 depends on Λ * (see eq. ( 59) in the appendix), and a larger Λ * means a higher complexity.

4. δ-CONTAMINATION UNCERTAINTY SET

In this section, we investigate a concrete example of robust constrained RL with δ-contamination uncertainty set. The method we developed here can be similarly extended to other type of uncertainty sets like KL-divergence or total variation. The δ-contamination uncertainty set models the scenario where the state transition of the MDP could be arbitrarily perturbed with a small probability δ. This model is widely used to model distributional uncertainty in the literature of robust learning and optimization, e.g., (Huber, 1965; Du et al., 2018; Huber & Ronchetti, 2009; Nishimura & Ozaki, 2004; 2006; Prasad et al., 2020a; b; Wang & Zou, 2021; 2022) . Specifically, let P = {p a s |s ∈ S, a ∈ A} be the centroid transition kernel, then the δ-contamination uncertainty set centered at P is defined as P ≜ s∈S,a∈A P a s , where P a s ≜ {(1 -δ)p a s + δq|q ∈ ∆ S } , s ∈ S, a ∈ A. Under the δ-contamination setting, the robust Bellman operator can be explicitly computed: T π V (s) = a∈A π(a|s) r(s, a) + γ δ min s ′ V (s ′ ) + (1 -δ) s ′ ∈S p a s,s ′ V (s ′ ) . In this case, the robust value function is non-differentiable due to the min term, and hence Assumption 3 does not hold. One possible approach is to use sub-gradient, which, however, is less stable, and its convergence is difficult to characterize. In the following, we design a differentiable and smooth approximation of the robust value function. Specifically, consider a smoothed robust Bellman operator T π σ using the LSE function: T π σ V (s) = E A∼π(•|s) r(s, A) + γ(1 -δ) s ′ ∈S p A s,s ′ V (s ′ ) + γδLSE(σ, V ) , where LSE(σ, V ) = log( d i=1 e σV (i) ) σ for V ∈ R d and some σ < 0. The approximation error |LSE(σ, V ) -min V | → 0 as σ → -∞, and hence the fixed point of T π σ , denoted by V π σ , is an approximation of the robust value function V π (Wang & Zou, 2022) . We refer to V π σ as the smoothed robust value function and define the smoothed robust action-value function as Q π σ (s, a) ≜ r(s, a) + γ(1 -δ) s ′ ∈S p a s,s ′ V π σ (s ′ ) + γδLSE(σ, V π σ ). It can be shown that for any π, as σ → -∞, ∥V π r -V π σ,r ∥ → 0 and ∥V π c -V π σ,c ∥ → 0. The gradient of V π θ σ can be computed explicitly (Wang & Zou, 2022) : A natural idea is to use the smoothed robust value functions to replace the ones in eq. ( 7): ∇V π θ σ (s) = B(s, θ) + γδ s∈S e σV π θ σ (s) B(s,θ) (1-γ) s∈S e σV π θ σ (s) , where B(s, θ) ≜ 1 1-γ+γδ s ′ ∈S d π θ s,P (s ′ ) a∈A ∇π θ (a|s ′ )Q π θ σ (s ′ , min λ≥0 max π∈Π Θ V π σ,r (ρ) + λ(V π σ,c (ρ) -b). As will be shown below in Lemma 6, this approximation can be arbitrarily close to the original problem in eq. ( 7) as σ → -∞. We first show that under Assumption 2, the following Slater's condition holds for the smoothed problem in eq. ( 15). Lemma 4. Let σ be sufficiently small such that ∥V π σ,c -V π c ∥ < ζ for any π, then there exists ζ ′ > 0 and a policy π ′ ∈ Π Θ s.t. V π ′ σ,c (ρ) -b ≥ ζ ′ . The following lemma shows that the optimal dual variable for eq. ( 15) is also bounded. Lemma 5. Denote the optimal solution of eq. ( 15) by (λ * , π θ * ). Then λ * ∈ 0, 2Cσ ζ ′ , where C σ is the upper bound of smoothed robust value functions V π σ,c .

Denote by Λ

* = max 2Cσ ζ ′ , 2 ζ(1-γ) , then problems eq. ( 8) and eq. ( 15) are equivalent to the following bounded ones: min λ∈[0,Λ * ] max π∈Π Θ V π r (ρ) + λ(V π c (ρ) -b), and min λ∈[0,Λ * ] max π∈Π Θ V π σ,r (ρ) + λ(V π σ,c (ρ) -b). The following lemma shows that the two problems are within a gap of O(ϵ). Lemma 6. Choose a small enough σ such that ∥V π r -V π σ,r ∥ ≤ ϵ and ∥V π c -V π σ,c ∥ ≤ ϵ. Then min λ∈[0,Λ * ] max π∈Π Θ V π σ,r (ρ) + λ(V π σ,c (ρ) -b) -min λ∈[0,Λ * ] max π∈Π Θ V π r (ρ) + λ(V π c (ρ) -b) ≤ (1 + Λ * ) ϵ. In the following, we hence focus on the smoothed dual problem in eq. ( 16), which is an accurate approximation of the original problem eq. ( 8). Denote the gradient mapping of the smoothed Lagrangian function V L σ by G σ t ≜   β t λ t -[0,Λ * ] λ t -1 βt ∇ λ V L σ (θ t , λ t ) α t θ t -Θ θ t + 1 αt ∇ θ V L σ (θ t , λ t )   . Applying our RPD algorithm in eq. ( 16), we have the following convergence guarantee. Corollary 1. If we set step sizes α t , β t , and b t as in Section J and set T = O(ϵ -4 ), then min 1≤t≤T ∥G σ t ∥ ≤ 2ϵ. This corollary implies that our robust primal-dual algorithm converges to a stationary point of the min-max problem eq. ( 16) under the δ-contamination model, with a complexity of O(ϵ -4 ). Algorithm 2 Smoothed Robust TD (Wang & Zou, 2022 ) Input: T inner , π, σ, c Initialization: Q 0 , s 0 for t = 0, 1, ..., T inner -1 do Choose a t ∼ π(•|s t ) and observe c t , s t+1 V t (s) ← a∈A π(a|s)Q t (s, a) for all s ∈ S Q t+1 (s t , a t ) ← Q t (s t , a t ) + α t c t + γ(1 -δ) • V t (s t+1 ) + γδ • LSE(σ, V t ) -Q t (s t , a t ) end for Output: Q Tinner,c ≜ Q Tinner Note that Algorithm 1 assumes knowledge of the smoothed robust value functions which may not be available in practice. Different from the non-robust value function which can be estimated using Monte Carlo, robust value functions are the value function corresponding to the worst-case transition kernel from which no samples are directly taken. To solve this issue, we adopt the smoothed robust TD algorithm (Algorithm 2) from (Wang & Zou, 2022) to estimate the smoothed robust value functions. It was shown that the smoothed robust TD algorithm converges to the smoothed robust value function with a sample complexity of O(ϵ -2 ) (Wang & Zou, 2022) under the tabular case. We then construct our online and model-free RPD algorithm as in Algorithm 3. We note that Algorithm 3 is for the tabular setting with finite S and A. It can be easily extended to the case with large/continuous S and A using function approximation.

Algorithm 3 Online Robust Primal-Dual algorithm

Input: T , σ, ϵ est , β t , α t , b t ,r,c Initialization: λ 0 , θ 0 for t = 0, 1, ..., T -1 do Set T inner = O (t+1) 1.5 ϵ 2 est and run Algorithm 2 for r and c, output Q Tinner,r , Q Tinner,c V π θ t σ,r (s) ← a π θt (a|s)Q Tinner,r (s, a), V π θ t σ,c (s) ← a π θt (a|s)Q Tinner,c (s, a) V π θ t σ,r (ρ) ← s ρ(s) V π θ t σ,r (s), V π θ t σ,c (ρ) ← s ρ(s) V π θ t σ,c (s) λ t+1 ← [0,Λ * ] λ t -1 βt V π θ t σ,c (ρ) -b -bt βt λ t θ t+1 ← Θ θ t + 1 αt ∇ θ V π θ t σ,r (ρ) + λ t+1 ∇ θ V π θ t σ,c ρ) end for Output: θ T Algorithm 3 can be viewed as a biased stochastic gradient descent-ascent algorithm. It is a samplebased algorithm without assuming any knowledge of robust value functions, and can be performed in an online fashion. We further extend the convergence results in Theorem 1 to the model-free setting, and characterize the following finite-time error bound of Algorithm 3. Similarly, Algorithm 3 can be shown to achieve a 2ϵ-feasible policy almost surely. 

6. CONCLUSION

In this paper, we formulate the problem of robust constrained reinforcement learning under model uncertainty, where the goal is to guarantee that constraints are satisfied for all MDPs in the uncertainty set, and to maximize the worst-case reward performance over the uncertainty set. We propose a robust primal-dual algorithm, and theoretically characterize its convergence, complexity and robust feasibility. Our algorithm guarantees convergence to a feasible solution, and outperforms the other two heuristic algorithms. We further investigate a concrete example with δ-contamination uncertainty set, and construct online and model-free robust primal-dual algorithm. Our methodology can also be readily extended to problems with other uncertainty sets like KL-divergence, total variation and Wasserstein distance. The major challenge lies in deriving the robust policy gradient, and further designing model-free algorithm to estimate the robust value function.



a), and d π θ s,P (•) is the visitation distribution of π θ under P starting from s. Denote the smoothed Lagrangian function byV L σ (θ, λ) ≜ V π θ σ,r (ρ) + λ(V π θ σ,c (ρ) -b).The following lemma shows that ∇V L σ is Lipschitz. Lemma 3. ∇V L σ is Lipschitz in θ and λ. And hence Assumption 3 holds for V L σ .

(a) Vc when δ = 0.2. (b) Vr when δ = 0.2. (c) Vc when δ = 0.3. (d) Vr when δ = 0.3.

Figure 1: Comparison on Garnet Problem G(20, 10).

(a) Vc when δ = 0.2. (b) Vr when δ = 0.2. (c) Vc when δ = 0.3. (d) Vr when δ = 0.3.

Figure 2: Comparison on 8 × 8 Frozen-Lake Problem.

Figure 3: Comparison on Taxi Problem.

annex

Under review as a conference paper at ICLR 2023 Under the online model-free setting, the estimation of the robust value functions is biased. Therefore, the analysis is more challenging than the existing literature, where it is usually assumed that the gradients are exact. We develop a new method to bound the bias accumulated in every iteration of the algorithm, and establish the final convergence results.Theorem 2. Consider the same conditions as in Theorem 1. Let

5. NUMERICAL RESULTS

In this section, we numerically demonstrate the robustness of our algorithm in terms of both maximizing robust reward value function and satisfying constraints under model uncertainty. We compare our RPD algorithm with the heuristic algorithms in (Russel et al., 2021; Mankowitz et al., 2020) and the vanilla non-robust primal-dual method. Based on the idea of "robust policy evaluation" + "non-robust policy improvement" in (Russel et al., 2021; Mankowitz et al., 2020) , we combine the robust TD algorithm 2 with non-robust vanilla policy gradient method (Sutton et al., 1999) , which we refer to as the heuristic primal-dual algorithm. Several environments, including Garnet (Archibald et al., 1995) , 8 × 8 Frozen-Lake and Taxi environments from OpenAI (Brockman et al., 2016) , are investigated.We first run the algorithms and store the obtained policies π t at each time step. Then we run the non-smoothed robust TD (Alg 3 in (Wang & Zou, 2022) ) with a sample size 200 for 30 times to estimate the non-smoothed objective V r (ρ) and the non-smoothed constraint V c (ρ). We then plot them v.s. the number of iterations t. The upper and lower envelopes of the curves correspond to the 95 and 5 percentiles of the 30 curves, respectively. We repeat the experiment for two different values of δ = 0.2, 0.3.

Garnet problem.

A Garnet problem can be specified by G(S n , A n ), where the state space S has S n states (s 1 , ..., s Sn ) and action space has A n actions (a 1 , ..., a An ). The agent can take any actions in any state, and receives a randomly generated reward/utility signal generated from the uniform distribution on [0,1]. The transition kernels are also randomly generated. The comparison results are shown in Fig. 1 .8 × 8 Frozen-Lake problem. We then compare the three algorithms under the 8 × 8 Frozen-lake problem setting in Fig. 2 . The Frozen-Lake problem involves a frozen lake of size 8 × 8 which contains several "holes". The agent aims to cross the lake from the start point to the end point without falling into any holes. The agent receives r = -10 and c = 0 when falling in a hole, receives r = 20 and c = 1 when arrive at the end point; At other times, the agent receives r = 0 and a randomly generated utility c according to the uniform distribution on [0,1].Taxi problem. We then compare the three algorithms under the Taxi problem environment. The taxi problem simulates a taxi driver in a 5 × 5 map. There are four designated locations in the grid world and a passenger occurs at a random location of the designated four locations at the start of each episode. The goal of the driver is to first pick up the passenger and then to drop off at another specific location. The driver receives r = 20 for each successful drop-off, and always receives r = -1 at other times. We randomly generate the utility according to the uniform distribution on [0,1] for each state-action pair. The results are shown in Fig. 3 .From the experiment results above, it can be seen that: (1) Both our RPD algorithm and the heuristic primal-dual approach find feasible policies satisfying the constraint robustly, i.e., the non-smoothed robust utility functions lie above the threshold V π c ≥ b. However, the non-robust primal-dual method fails to find a feasible solution that satisfy the constraint under the worst-case scenario. (2) Compared to the heuristic PD method, our RPD method can obtain more reward and can find a more robust policy while satisfying the robust constraint. Note that the non-robust PD method obtain more reward, but this is because the policy it finds violates the robust constraint. Our experiments demonstrate that among the three algorithms, our RPD algorithm is the best one which optimizes the worst-case reward performance while satisfying the robust constraints on the utility.

