ROBUST CONSTRAINED REINFORCEMENT LEARNING

Abstract

Constrained reinforcement learning is to maximize the reward subject to constraints on utilities/costs. However, in practice it is often the case that the training environment is not the same as the test one, due to, e.g., modeling error, adversarial attack, non-stationarity, resulting in severe performance degradation and more importantly constraint violation in the test environment. To address this challenge, we formulate the framework of robust constrained reinforcement learning under model uncertainty, where the MDP is not fixed but lies in some uncertainty set. The goal is two fold: 1) to guarantee that constraints on utilities/costs are satisfied for all MDPs in the uncertainty set, and 2) to maximize the worst-case reward performance over the uncertainty set. We design a robust primal-dual approach, and further develop theoretical guarantee on its convergence, complexity and robust feasibility. We then investigate a concrete example of δ-contamination uncertainty set, design an online and model-free algorithm and theoretically characterize its sample complexity.

1. INTRODUCTION

In many practical reinforcement learning (RL) applications, it is critical for an agent to meet certain constraints on utilities/costs while maximizing the reward. This problem is usually modeled as the constrained Markov decision processes (CMDPs) (Altman, 1999) . Consider a CMDP with state space S, action space A, transition kernel P = {p a s ∈ ∆ S 1 : s ∈ S, a ∈ A}, reward and utility functions: r, c i : S × A → [0, 1], 1 ≤ i ≤ m, and discount factor γ. The goal of CMDP is to find a stationary policy π : S → ∆ A that maximizes the expected reward subject to constraints on the utility: max π∈Π E π,P ∞ t=0 γ t r(S t , A t )|S 0 ∼ ρ , s.t. E π,P ∞ t=0 γ t c i (S t , A t )|S 0 ∼ ρ ≥ b i , 1 ≤ i ≤ m, (1) where ρ is the initial state distribution, b i 's are some thresholds and E π,P denotes the expectation when the agent follows policy π and the environment transits following P. In practice, it is often the case that the environment on which the learned policy will deploy (the test environment) possibly deviates from the training one, due to, e.g., modeling error of the simulator, adversarial attack, and non-stationarity. This could lead to a significant performance degradation in reward, and more importantly, constraints may not be satisfied anymore, which is severe in safetycritical applications. For example, a drone may run out of battery and crash due to mismatch between training and test environments. This hence motivates the study of robust constrained RL in this paper. In this paper, we take a pessimistic approach in face of uncertainty. Specifically, consider a set of transition kernels P, which is usually constructed in a way to include the test environment with high probability (Iyengar, 2005; Nilim & El Ghaoui, 2004; Bagnell et al., 2001) . The learned policy should satisfy the constraints under all these environments in P, i.e., ∀P ∈ P, E π,P ∞ t=0 γ t c i (S t , A t )|S 0 ∼ ρ ≥ b i , which is equivalent to min P∈P E π,P [ ∞ t=0 γ t c(S t , A t )|S 0 ∼ ρ] ≥ b i . At the same time, we aim to optimize the worst-case reward performance over P: 



t r(S t , A t )|S 0 ∼ ρ , 1 ∆ X denotes the probability simplex supported on the set X.

1

