ROBUST CONSTRAINED REINFORCEMENT LEARNING

Abstract

Constrained reinforcement learning is to maximize the reward subject to constraints on utilities/costs. However, in practice it is often the case that the training environment is not the same as the test one, due to, e.g., modeling error, adversarial attack, non-stationarity, resulting in severe performance degradation and more importantly constraint violation in the test environment. To address this challenge, we formulate the framework of robust constrained reinforcement learning under model uncertainty, where the MDP is not fixed but lies in some uncertainty set. The goal is two fold: 1) to guarantee that constraints on utilities/costs are satisfied for all MDPs in the uncertainty set, and 2) to maximize the worst-case reward performance over the uncertainty set. We design a robust primal-dual approach, and further develop theoretical guarantee on its convergence, complexity and robust feasibility. We then investigate a concrete example of δ-contamination uncertainty set, design an online and model-free algorithm and theoretically characterize its sample complexity.

1. INTRODUCTION

In many practical reinforcement learning (RL) applications, it is critical for an agent to meet certain constraints on utilities/costs while maximizing the reward. This problem is usually modeled as the constrained Markov decision processes (CMDPs) (Altman, 1999) . Consider a CMDP with state space S, action space A, transition kernel P = {p a s ∈ ∆ S 1 : s ∈ S, a ∈ A}, reward and utility functions: r, c i : S × A → [0, 1], 1 ≤ i ≤ m, and discount factor γ. The goal of CMDP is to find a stationary policy π : S → ∆ A that maximizes the expected reward subject to constraints on the utility: max π∈Π E π,P ∞ t=0 γ t r(S t , A t )|S 0 ∼ ρ , s.t. E π,P ∞ t=0 γ t c i (S t , A t )|S 0 ∼ ρ ≥ b i , 1 ≤ i ≤ m, (1) where ρ is the initial state distribution, b i 's are some thresholds and E π,P denotes the expectation when the agent follows policy π and the environment transits following P. In practice, it is often the case that the environment on which the learned policy will deploy (the test environment) possibly deviates from the training one, due to, e.g., modeling error of the simulator, adversarial attack, and non-stationarity. This could lead to a significant performance degradation in reward, and more importantly, constraints may not be satisfied anymore, which is severe in safetycritical applications. For example, a drone may run out of battery and crash due to mismatch between training and test environments. This hence motivates the study of robust constrained RL in this paper. In this paper, we take a pessimistic approach in face of uncertainty. Specifically, consider a set of transition kernels P, which is usually constructed in a way to include the test environment with high probability (Iyengar, 2005; Nilim & El Ghaoui, 2004; Bagnell et al., 2001) . The learned policy should satisfy the constraints under all these environments in P, i.e., ∀P ∈ P, E π,P ∞ t=0 γ t c i (S t , A t )|S 0 ∼ ρ ≥ b i , which is equivalent to min P∈P E π,P [ ∞ t=0 γ t c(S t , A t )|S 0 ∼ ρ] ≥ b i . At the same time, we aim to optimize the worst-case reward performance over P: max π∈Π min P∈P E π,P ∞ t=0 γ t r(S t , A t )|S 0 ∼ ρ , 1 ∆ X denotes the probability simplex supported on the set X.

annex

(3)On one hand, a feasible solution to eq. ( 3) always satisfies eq. ( 2), and on the other hand, the solution to eq. ( 3) provides a performance guarantee for any P ∈ P. We note that our approach and analysis can also be applied to the optimistic approach in face of uncertainty.In this paper, we design and analyze a robust primal-dual algorithm for the problem of robust constrained RL. In particular, the technical challenges and our major contributions are as follows.• We take the Lagrange multiplier method to solve the constrained policy optimization problem. A first question is that whether the primal problem is equivalent to the dual problem, i.e., whether the duality gap is zero. For non-robust constrained RL, the Lagrange function has a zero duality gap (Paternain et al., 2019; Altman, 1999) . However, we show that this is not necessarily true in the robust constrained setting. Note that the set of visitation distribution being convex is one key property to show zero duality gap of constrained MDP (Altman, 1999; Paternain et al., 2019) .In this paper, we constructed a novel counter example showing that the set of robust visitation distributions for our robust problem is non-convex. • In the dual problem of non-robust CMDPs, the sum of two value functions is actually a value function of the combined reward. However, this does not hold in the robust setting, since the worstcase transition kernels for the two robust value functions are not necessarily the same. Therefore, the geometry of our Lagrangian function is much more complicated. In this paper, we formulate the dual problem of the robust constrained RL problem as a minimax linear-nonconcave optimization problem, and show that the optimal dual variable is bounded. We then construct a robust primaldual algorithm by alternatively updating the primal and dual variables. We theoretically prove the convergence to stationary points, and characterize its complexity. • In general, convergence to stationary points of the Lagrangian function does not necessarily imply that the solution is feasible (Lin et al., 2020; Xu et al., 2020) . We design a novel proof to show that the gradient belongs to the normal cone of the feasible set, based on which we further prove the robust feasibility of the obtained policy. • We apply and extend our results on an important uncertainty set referred to as δ-contamination model (Huber, 1965) . Under this model, the robust value functions are not differentiable and we hence propose a smoothed approximation of the robust value function towards a better geometry. We further investigate the practical online and model-free setting and design an actor-critic type algorithm. We also establish its convergence, sample complexity, and robust feasibility.We then discuss works related to robust constrained RL.Robust constrained RL. In (Russel et al., 2020) , the robust constrained RL problem was studied, and a heuristic approach was developed. The basic idea is to estimate the robust value functions, and then to use the vanilla policy gradient method (Sutton et al., 1999) with the vanilla value function replaced by the robust value function. However, this approach did not take into consideration the fact that the worst-case transition kernel is also a function of the policy (see Section 3.1 in (Russel et al., 2020)), and therefore the "gradient" therein is not actually the gradient of the robust value function. Thus, its performance and convergence cannot be theoretically guaranteed. The other work (Mankowitz et al., 2020) studied the same robust constrained RL problem under the continuous control setting, and proposed a similar heuristic algorithm. They first proposed a robust Bellman operator and used it to estimate the robust value function, which is further combined with some non-robust continuous control algorithm to update the policy. Both approaches in (Russel et al., 2020) and (Mankowitz et al., 2020) inherit the heuristic structure of "robust policy evaluation" + "non-robust vanilla policy improvement", which may not necessarily guarantee an improved policy in general. In this paper, we employ a "robust policy evaluation" + "robust policy improvement" approach, which guarantees an improvement in the policy, and more importantly, we provide theoretical convergence guarantee, robust feasibility guarantee, and complexity analysis for our algorithms.Constrained RL. The most commonly used method for constrained RL is the primal-dual method (Altman, 1999; Paternain et al., 2019; 2022; Liang et al., 2018; Stooke et al., 2020; Tessler et al., 2018; Yu et al., 2019; Zheng & Ratliff, 2020; Efroni et al., 2020; Auer et al., 2008) , which augments the objective with a sum of constraints weighted by their corresponding Lagrange multipliers, and then alternatively updates the primal and dual variables. It was shown that the strong duality holds

