A CMDP-WITHIN-ONLINE FRAMEWORK FOR META-SAFE REINFORCEMENT LEARNING

Abstract

Meta-reinforcement learning has widely been used as a learning-to-learn framework to solve unseen tasks with limited experience. However, the aspect of constraint violations has not been adequately addressed in the existing works, making their application restricted in real-world settings. In this paper, we study the problem of meta-safe reinforcement learning (Meta-SRL) through the CMDP-within-online framework to establish the first provable guarantees in this important setting. We obtain task-averaged regret bounds for the reward maximization (optimality gap) and constraint violations using gradient-based meta-learning and show that the taskaveraged optimality gap and constraint satisfaction improve with task-similarity in a static environment or task-relatedness in a dynamic environment. Several technical challenges arise when making this framework practical. To this end, we propose a meta-algorithm that performs inexact online learning on the upper bounds of within-task optimality gap and constraint violations estimated by offpolicy stationary distribution corrections. Furthermore, we enable the learning rates to be adapted for every task and extend our approach to settings with a competing dynamically changing oracle. Finally, experiments are conducted to demonstrate the effectiveness of our approach.

1. INTRODUCTION

The field of meta-reinforcement learning (meta-RL) has recently evolved as one of the promising directions that enables reinforcement learning (RL) agents to learn quickly in dynamically changing environments (Finn et al., 2017; Mitchell et al., 2021; Zintgraf et al., 2021) . Many real-world applications, nevertheless, have safety constraints that should rarely be violated, which existing works do not fully address. Safe RL problems are often modeled as constrained Markov decision processes (CMDPs), where the agent aims to maximize the value function while satisfying given constraints on the trajectory (Altman, 1999). However, unlike meta-learning, CMDP algorithms are not designed to generalize efficiently over unseen tasks (Paternain et al., 2022; Ding et al., 2021a; Ding & Lavaei, 2022) In this paper, we study how meta-learning can be principally designed to help safe RL algorithms adapt quickly while satisfying safety constraints. There are several unique challenges involved in meta-learning for the CMDP settings. First, multiple losses are incurred at each time step, i.e., reward and constraints, which are typically nonconvex and coupled through dynamics. Hence, adapting existing theories developed for stylized settings such as online convex optimization (Hazan et al., 2016) is not straightforward. Second, it is unrealistic to assume the computation of a globally optimal policy for CMDPs (unlike online learning (Hazan et al., 2016) ). Thus, classical online learning algorithms that assume exact or unbiased estimator of the loss function do not apply (Khodak et al., 2019) . Overall, there is an interplay among nonconvexity, the stochastic nature of the optimization problem, as well as algorithm and generalization considerations, posing significant complexity to leverage inter-task dependency (Denevi et al., 2019) . To this end, we propose a provably low-regret online learning framework that extends the current meta-learning algorithms to safe RL settings. Our main contributions are as follows: 1. Inexact CMDP-within-online framework: We propose a novel CMDP-within-online framework where the within-task is CMDP, and the meta-learner aims to learn the meta-initialization and learning rate. In our framework, the meta-learner only requires the inexact optimal policies for each within-task CMDP and the approximate state visitation distributions estimated using collected offline trajectories to construct the upper bounds on the suboptimality gap and constraint violations. An upper bound on these estimation errors is established in Theorem 3.1. 2. Task-averaged regret in terms of empirical task-similarity: We show that the task-averaged regrets for optimality gap (TAOG) and constraint violations (TACV) (Def. 1) diminish with respect to both the number of steps in the within-task algorithm M and the number of tasks T . Specifically, task-averaged regret of O 1 √ M E T √ T + D * 2 holds , where E T is the total inexactness in online learning and D * is the empirical task-similarity (Theorem 3.2).

3.. Adapting to a dynamic environment:

We adapt the learning rates for each task to environments that entail dynamically changing meta-initialization policies. An improved rate of O 1 M 3/4 √ T E T + E T T + V 2 ψ for TAOG and TACV are shown, where Vψ is the empirical task-relatedness with respect to a sequence of changing comparator policies {ψ * t } T t=1 (Corollary 1). Incorporating all these components makes our Meta-safe RL (Meta-SRL) approach highly practical and theoretically appealing for potential adaption to different RL settings. Furthermore, we remark on some key technical contributions that support the above developments, which may be of independent interest: 1) We study the optimization landscape of CMDP (Theorem 3.1) that is algorithmic-agnostic, which differs from the existing work of (Mei et al., 2020) [Lemmas 3 and 15] that is restricted to the setting of policy gradient. This is achieved by developing new techniques based on tame geometry and subgradient flow systems; 2) we provide static and dynamic regret bounds for inexact online gradient descent (see Appendix E), which we leverage to obtain our final theoretical results in Theorems 3.2, 3.3, and Corollary 1. Due to the space restrictions, the related work can be found in Appendix A.

2. CMDP-WITHIN-ONLINE FRAMEWORK

In this section, we introduce the CMDP-within-online framework for the Meta-SRL problems. In this framework, a within-task algorithm (such as CRPO (Xu et al., 2021) ) for some CMDP task t ∈ [T ] is encapsulated in an online learning algorithm (meta-learning algorithm), which decides upon a sequence of initialization policy ϕ t and learning rate α t > 0 for each within-task algorithm. The goal of the meta-learning algorithm is to minimize some notion of task-averaged performance regret to facilitate provably efficient adaptation to a new task.

2.1. CMDP AND THE PRIMAL APPROACH

Model. For each task t ∈ [T ], a CMDP M t is defined by the state space S, the action space A, discount factor γ, initial state distribution over the state-space ρ t , the transition kernel P t (s ′ |s, a) : S × A → S, reward functions c t,0 : S × A → [0, 1] and cost functions c t,i : S × A → [0, 1] for i = 1, ..., p. The actions are chosen according to a stochastic policy π t : S → ∆(A) where ∆(A) is the simplex over the action space. We use ∆(A) |S| to denote the simplex over all states S. The initial policy for task t is denoted as π t,0 . The discounted state visitation distribution of a policy π is defined as ν π t,s0 (s) := (1γ) ∞ m=0 γ m P t (s m = s | π, s 0 ) and we write ν * t (s) := E s0∼ρt ν π * t,s0 (s) as the visitation distribution when the initial state follows ρ t at task t. We denote π * t as an optimal policy for task t and ν * t (s) := E s0∼ρt ν π * t t,s0 (s) is the corresponding state visitation distribution induced by policy π * t when the initial state s 0 is sampled from initial state distribution ρ t at task t.

