ONLINE POLICY OPTIMIZATION FOR ROBUST MDP

Abstract

Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go. However, real-world deployment of end-to-end RL models is less common, as RL models can be very sensitive to slight perturbation of the environment. The robust Markov decision process (MDP) framework-in which the transition probabilities belong to an uncertainty set around a nominal model-provides one way to develop robust models. While previous analysis shows RL algorithms are effective assuming access to a generative model, it remains unclear whether RL can be efficient under a more realistic online setting, which requires a careful balance between exploration and exploitation. In this work, we consider online robust MDP by interacting with an unknown nominal system. We propose a robust optimistic policy optimization algorithm that is provably efficient. To address the additional uncertainty caused by an adversarial environment, our model features a new optimistic update rule derived via Fenchel conjugates. Our analysis establishes the first regret bound for online robust MDPs.

1. INTRODUCTION

The rapid progress of reinforcement learning (RL) algorithms enables trained agents to navigate around complicated environments and solve complex tasks. The standard reinforcement learning methods, however, may fail catastrophically in another environment, even if the two environments only differ slightly in dynamics (Farebrother et al., 2018; Packer et al., 2018; Cobbe et al., 2019; Song et al., 2019; Raileanu & Fergus, 2021) . In practical applications, such mismatch of environment dynamics are common and can be caused by a number of reasons, e.g., model deviation due to incomplete data, unexpected perturbation and possible adversarial attacks. Part of the sensitivity of standard RL algorithms stems from the formulation of the underlying Markov decision process (MDP). In a sequence of interactions, MDP assumes the dynamic to be unchanged, and the trained agent to be tested on the same dynamic thereafter. To model the potential mismatch between system dynamics, the framework of robust MDP is introduced to account for the uncertainty of the parameters of the MDP (Satia & Lave Jr, 1973; White III & Eldeib, 1994; Nilim & El Ghaoui, 2005; Iyengar, 2005) . Under this framework, the dynamic of an MDP is no longer fixed but can come from some uncertainty set, such as the rectangular uncertainty set, centered around a nominal transition kernel. The agent sequentially interacts with the nominal transition kernel to learn a policy, which is then evaluated on the worst possible transition from the uncertainty set. Therefore, instead of searching for a policy that may only perform well on the nominal transition kernel, the objective is to find the worst-case best-performing policy. This can be viewed as a dynamical zero-sum game, where the RL agent tries to choose the best policy while nature imposes the worst possible dynamics. Intrinsically, solving the robust MDPs involves solving a max-min problem, which is known to be challenging for efficient algorithm designs. More specifically, if a generative model (also known as a simulator) of the environment or a suitable offline dataset is available, one could obtain a ϵ-optimal robust policy with Õ(ϵ -2 ) samples under a rectangular uncertainty set (Qi & Liao, 2020; Panaganti & Kalathil, 2022; Wang & Zou, 2022; Ma et al., 2022 ). Yet the presence of a generative model is stringent to fulfill for real applications. In a more practical online setting, the agent sequentially interacts with the environment and tackles the exploration-exploitation challenge as it balances between exploring the state space and exploiting the high-reward actions. In the robust MDP setting, previous sample complexity results cannot In this paper, we answer the above question affirmatively and propose the first policy optimization algorithm for robust MDP under a rectangular uncertainty set. One of the challenges for deriving a regret guarantee for robust MDP stems from its adversarial nature. As the transition dynamic can be picked adversarially from a predefined set, the optimal policy may be randomized (Wiesemann et al., 2013) . This is in contrast with conventional MDPs, where there always exists a deterministic optimal policy, which can be found with value-based methods and a greedy policy (e.g. UCB-VI algorithms). Bearing this observation, we resort to policy optimization (PO)-based methods, which directly optimize a stochastic policy in an incremental way. With a stochastic policy, our algorithm explores robust MDPs in an optimistic manner. To achieve this robustly, we propose a carefully designed bonus function via the dual conjugate of the robust bellman equation. This quantifies both the uncertainty stemming from the limited historical data and the uncertainty of the MDP dynamic. In the episodic setting of robust MDPs, we show that our algorithm attains sublinear regret O( √ K) for both (s, a) and s-rectangular uncertainty set, where K is the number of episodes. In the case where the uncertainty set contains only the nominal transition model, our results recover the previous regret upper bound of non-robust policy optimization (Shani et al., 2020) . Our result achieves the first provably efficient regret bound in the online robust MDP problem, as shown in Table 1 . We further validated our algorithm with experiments. Table 1 : Comparisons of previous results and our results, where S, A are the size of the state space and action space, H is the length of the horizon, K is the number of episodes, ρ is the radius of the uncertainty set and ϵ is the level of suboptimality. We shorthand ι = log(SAH 2 K 3/2 (1 + ρ)). 

Algorithm Requires Rectangular Regret Sample Complexity

[A] Value based GM (s, a) O K 2 3 H 5 3 S 2 3 A 1 3 * O H 4 S 2 A ϵ 2 [B] Value based - (s, a) NA Asymptotic [C] Policy based - (s, a) NA Asymptotic [D] Value based GM (s, a) NA Õ H 4 S 2 A(2+ρ) 2 ρ 2 ϵ 2 s NA Õ H 4 S 2 A 2 (2+ρ) 2 ρ 2 ϵ 2 Ours Policy based - (s, a) O SH 2 √ AKι O H 4 S 2 Aι ϵ 2 s O SA 2 H 2 √ Kι O H 4 S 2 A 4 ι ϵ 2

2. RELATED WORK

RL with robust MDP Different from conventional MDPs, robust MDPs allow the transition kernel to take values from an uncertainty set. The objective in robust MDPs is to learn an optimal robust policy that maximizes the worst-case value function. When the exact uncertainty set is known, this can be solved through dynamic programming methods (Iyengar, 2005; Nilim & El Ghaoui, 2005; Mannor et al., 2012) . Yet knowing the exact uncertainty set is a rather stringent requirement for most real applications. If one has access to a generative model, several model-based reinforcement learning methods are proven to be statistically efficient. With the different characterization of the uncertainty set, these methods can enjoy a sample complexity of O(1/ϵ 2 ) for an ϵ-optimal robust



The regret upper bound by Panaganti & Kalathil (2022) are obtained through converting their sample complexity results and the sample complexity result for our work is converted through our regret bound. We use "GM" to denote the requirement of a generative model. The superscript * stands for results obtained via batch-to-online conversion. The reference to the previous works are [A]: Panaganti & Kalathil (2022), [B]: Wang & Zou (2021), [C]: Badrinath & Kalathil (2021), [D]: Yang et al. (2021).

