ONLINE POLICY OPTIMIZATION FOR ROBUST MDP

Abstract

Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go. However, real-world deployment of end-to-end RL models is less common, as RL models can be very sensitive to slight perturbation of the environment. The robust Markov decision process (MDP) framework-in which the transition probabilities belong to an uncertainty set around a nominal model-provides one way to develop robust models. While previous analysis shows RL algorithms are effective assuming access to a generative model, it remains unclear whether RL can be efficient under a more realistic online setting, which requires a careful balance between exploration and exploitation. In this work, we consider online robust MDP by interacting with an unknown nominal system. We propose a robust optimistic policy optimization algorithm that is provably efficient. To address the additional uncertainty caused by an adversarial environment, our model features a new optimistic update rule derived via Fenchel conjugates. Our analysis establishes the first regret bound for online robust MDPs.

1. INTRODUCTION

The rapid progress of reinforcement learning (RL) algorithms enables trained agents to navigate around complicated environments and solve complex tasks. The standard reinforcement learning methods, however, may fail catastrophically in another environment, even if the two environments only differ slightly in dynamics (Farebrother et al., 2018; Packer et al., 2018; Cobbe et al., 2019; Song et al., 2019; Raileanu & Fergus, 2021) . In practical applications, such mismatch of environment dynamics are common and can be caused by a number of reasons, e.g., model deviation due to incomplete data, unexpected perturbation and possible adversarial attacks. Part of the sensitivity of standard RL algorithms stems from the formulation of the underlying Markov decision process (MDP). In a sequence of interactions, MDP assumes the dynamic to be unchanged, and the trained agent to be tested on the same dynamic thereafter. To model the potential mismatch between system dynamics, the framework of robust MDP is introduced to account for the uncertainty of the parameters of the MDP (Satia & Lave Jr, 1973; White III & Eldeib, 1994; Nilim & El Ghaoui, 2005; Iyengar, 2005) . Under this framework, the dynamic of an MDP is no longer fixed but can come from some uncertainty set, such as the rectangular uncertainty set, centered around a nominal transition kernel. The agent sequentially interacts with the nominal transition kernel to learn a policy, which is then evaluated on the worst possible transition from the uncertainty set. Therefore, instead of searching for a policy that may only perform well on the nominal transition kernel, the objective is to find the worst-case best-performing policy. This can be viewed as a dynamical zero-sum game, where the RL agent tries to choose the best policy while nature imposes the worst possible dynamics. Intrinsically, solving the robust MDPs involves solving a max-min problem, which is known to be challenging for efficient algorithm designs. More specifically, if a generative model (also known as a simulator) of the environment or a suitable offline dataset is available, one could obtain a ϵ-optimal robust policy with Õ(ϵ -2 ) samples under a rectangular uncertainty set (Qi & Liao, 2020; Panaganti & Kalathil, 2022; Wang & Zou, 2022; Ma et al., 2022 ). Yet the presence of a generative model is stringent to fulfill for real applications. In a more practical online setting, the agent sequentially interacts with the environment and tackles the exploration-exploitation challenge as it balances between exploring the state space and exploiting the high-reward actions. In the robust MDP setting, previous sample complexity results cannot directly imply a sublinear regret in general Dann et al. (2017) and so far no asymptotic result is available. A natural question then arises: Can we design a robust RL algorithm that attains sublinear regret under robust MDP with rectangular uncertainty set? In this paper, we answer the above question affirmatively and propose the first policy optimization algorithm for robust MDP under a rectangular uncertainty set. One of the challenges for deriving a regret guarantee for robust MDP stems from its adversarial nature. As the transition dynamic can be picked adversarially from a predefined set, the optimal policy may be randomized (Wiesemann et al., 2013) . This is in contrast with conventional MDPs, where there always exists a deterministic optimal policy, which can be found with value-based methods and a greedy policy (e.g. UCB-VI algorithms). Bearing this observation, we resort to policy optimization (PO)-based methods, which directly optimize a stochastic policy in an incremental way. With a stochastic policy, our algorithm explores robust MDPs in an optimistic manner. To achieve this robustly, we propose a carefully designed bonus function via the dual conjugate of the robust bellman equation. This quantifies both the uncertainty stemming from the limited historical data and the uncertainty of the MDP dynamic. In the episodic setting of robust MDPs, we show that our algorithm attains sublinear regret O( √ K) for both (s, a) and s-rectangular uncertainty set, where K is the number of episodes. In the case where the uncertainty set contains only the nominal transition model, our results recover the previous regret upper bound of non-robust policy optimization (Shani et al., 2020) . Our result achieves the first provably efficient regret bound in the online robust MDP problem, as shown in Table 1 . We further validated our algorithm with experiments. Table 1 : Comparisons of previous results and our results, where S, A are the size of the state space and action space, H is the length of the horizon, K is the number of episodes, ρ is the radius of the uncertainty set and ϵ is the level of suboptimality. We shorthand ι = log(SAH 2 K 3/2 (1 + ρ)). The regret upper bound by Panaganti & Kalathil (2022) are obtained through converting their sample complexity results and the sample complexity result for our work is converted through our regret bound. We use "GM" to denote the requirement of a generative model. The superscript * stands for results obtained via batch-to-online conversion. The reference to the previous works are 

Algorithm Requires Rectangular Regret Sample Complexity

[A] Value based GM (s, a) O K 2 3 H 5 3 S 2 3 A 1 3 * O H 4 S 2 A ϵ 2 [B] Value based - (s, a) NA Asymptotic [C] Policy based - (s, a) NA Asymptotic [D] Value based GM (s, a) NA Õ H 4 S 2 A(2+ρ) 2 ρ 2 ϵ 2 s NA Õ H 4 S 2 A 2 (2+ρ) 2 ρ 2 ϵ 2 Ours Policy based - (s, a) O SH 2 √ AKι O H 4 S 2 Aι ϵ 2 s O SA 2 H 2 √ Kι O H 4 S 2 A 4 ι ϵ 2

2. RELATED WORK

RL with robust MDP Different from conventional MDPs, robust MDPs allow the transition kernel to take values from an uncertainty set. The objective in robust MDPs is to learn an optimal robust policy that maximizes the worst-case value function. When the exact uncertainty set is known, this can be solved through dynamic programming methods (Iyengar, 2005; Nilim & El Ghaoui, 2005; Mannor et al., 2012) . Yet knowing the exact uncertainty set is a rather stringent requirement for most real applications. If one has access to a generative model, several model-based reinforcement learning methods are proven to be statistically efficient. With the different characterization of the uncertainty set, these methods can enjoy a sample complexity of O(1/ϵ 2 ) for an ϵ-optimal robust value function (Panaganti & Kalathil, 2022; Yang et al., 2021) . Similar results can also be achieved if an offline dataset is present, for which previous works Qi & Liao (2020) ; Zhou et al. (2021) ; Kallus et al. (2022) ; Ma et al. (2022) show the O(1/ϵ 2 ) sample complexity for an ϵ-optimal policy. In addition, Liu et al. (2022) proposed distributionally robust policy Q-learning, which solves for the asymptotically optimal Q-function. In the case of online RL, the only results available are asymptotic. In the case of discounted MDPs, Wang & Zou (2021) ; Badrinath & Kalathil (2021) study the policy gradient method and show an O(ϵ -3 ) convergence rate for an alternative learning objective (a smoothed variant), which could be equivalent to the original policy gradient objective in an asymptotic regime. These results in sample complexity and asymptotic regimes in general cannot imply sublinear regret in robust MDPs (Dann et al., 2017) . RL with adversarial MDP Another line of works characterizes the uncertainty of the environment through the adversarial MDP formulation, where the environmental parameters can be adversarially chosen without restrictions. This problem is proved to be NP-hard to obtain a low regret (Even-Dar et al., 2004) . Several works study the variant where the adversarial could only modify the reward function, while the transition dynamics of the MDP remain unchanged. In this case, it is possible to obtain policy-based algorithms that are efficient with a sublinear regret (Rosenberg & Mansour, 2019; Jin & Luo, 2020; Jin et al., 2020; Shani et al., 2020; Cai et al., 2020) . On a separate vein, it investigates the setting where the transition is only allowed to be adversarially chosen for C out of the K total episodes. A regret of O(C 2 + √ K) are established thereafter (Lykouris et al., 2021; Chen et al., 2021b; Zhang et al., 2022) .

Non-robust policy optimization

The problem of policy optimization has been extensively investigated under non-robust MDPs (Neu et al., 2010; Cai et al., 2020; Shani et al., 2020; Wu et al., 2022; Chen et al., 2021a) . The proposed methods are proved to achieve sublinear regret. The methods are also closely related to empirically successful policy optimization algorithms in RL, such as PPO Schulman et al. (2017) and TRPO Schulman et al. (2015) .

3. ROBUST MDP AND UNCERTAINTY SETS

In this section, we describe the formal setup of robust MDP. We start with defining some notations.

Robust Markov decision process

We consider an episodic finite horizon robust MDP, which can denoted by a tuple M = ⟨S, A, H, {P h } H h=1 , {r} H h=1 ⟩. Here S is the state space, A is the action space, {r} H h=1 is the time-dependent reward function, and H is the length of each episode. Instead of a fixed time-dependent uncertainty kernels, the transitions of the robust MDP is governed by kernels that are within a time-dependent uncertainty set {P h } H h=1 , i .e., time-dependent transition P h ∈ P h ⊆ ∆ S at time h. The uncertainty set P is constructed around a nominal transition kernel {P o h }, and all transition dynamics within the set are close to the nominal kernel with a distance metric of one's choice. Different from an episodic finite-horizon non-robust MDP, the transition kernel P may not only be time-dependent but may also be chosen (even adversarially) from a specified time-dependent uncertainty set P. We consider the case where the rewards are stochastic. This is, on state-action (s, a) at time h, the immediate reward is R h (s, a) ∈ [0, 1], which is drawn i.i.d from a distribution with expectation r h (s, a). With the described setup of robust MDPs, we now define the policy and its associated value. Policy and robust value function A time-dependent policy π is defined as π = {π h } H h=1 , where each π h is a function from S to the probability simplex over actions, ∆(A). If the transition kernel is fixed to be P , the performance of a policy π starting from state s at time h can be measured by its value function, which is defined as V π,P h (s) = E π,P H h ′ =h r h ′ (s h ′ , a h ′ ) | s h = s . In robust MDP, the robust value function instead measures the performance of π under the worst possible choice of transition P within the uncertainty set. Specifically, the value and the Q-value function of a policy given the state action pair (s, a) at step h are defined as V π h (s) = min {P h }∈{P h } V π,{P } h (s) , Q π h (s, a) = min {P h }∈{P h } E π,{P } H h ′ =h r h (s h ′ , a h ′ ) | (s h , a h ) = (s, a) . The optimal value function is defined to be the best possible value attained by a policy V * h (s) = max π V π h (s) = max π min {P h }∈{P h } V π,{P } h (s) . The optimal policy is then defined to be the policy that attains the optimal value. Robust Bellman equation Similar to non-robust MDP, robust MDP has the following robust bellman equation, which characterizes a relation to the robust value function (Ho et al., 2021; Yang et al., 2021) . Q π h (s, a) = r(s, a) + σ P h (V π h+1 )(s, a) , V π h (s) = ⟨Q π h (s, •), π h (•, s)⟩ , where σ P h (V π h+1 )(s, a) = min P h ∈P h P h (• | s, a)V π h+1 , P h (• | s, a)V = s ′ ∈S P h (s ′ | s, a)V (s ′ ) . Without additional assumptions on the uncertainty set, the optimal policy and value of the robust MDP are in general NP-hard to solve (Wiesemann et al., 2013) . One of the most commonly assumptions that make solving optimal value feasible is the rectangular assumption (Iyengar, 2005; Wiesemann et al., 2013; Badrinath & Kalathil, 2021; Yang et al., 2021; Panaganti & Kalathil, 2022) .

Rectangular uncertainty sets

To limit the level of perturbations, we assume that the transition kernels is close to the nominal transition measured via ℓ 1 distance. We consider two cases. The (s, a)-rectangular assumption assumes that the uncertain transition kernel within the set takes value independently for each (s, a). We further use ℓ 1 distance to characterize the (s, a)-rectangular set around a nominal kernel with a specified level of uncertainty. Definition 3.1 ((s, a)-rectangular uncertainty set Iyengar (2005); Wiesemann et al. (2013) ). For all time step h and with a given state-action pair (s, a), the (s, a)-rectangular uncertainty set P h (s, a) is defined as P h (s, a) = {∥P h (• | s, a) -P o h (• | s, a)∥ 1 ≤ ρ, P h (• | s, a) ∈ ∆(S)} , where P o h is the nominal transition kernel at h, P o h (• | s, a) > 0, ∀(s, a) ∈ S × A, ρ is the level of uncertainty and ∆(S) denotes the probability simplex over the state space S. With the (s, a)-rectangular set, it is shown that there always exists an optimal policy that is deterministic Wiesemann et al. (2013) . One way to relax the (s, a)-rectangular assumption is to instead let the uncertain transition kernels within the set take value independent for each s only. This characterization is then more general and its solution gives a stronger robustness guarantee. Definition 3.2 (s-rectangular uncertainty set Wiesemann et al. (2013) ). For all time step h and with a given state s, the s-rectangular uncertainty set P h (s) is defined as P h (s) = a∈A ∥P h (• | s, a) -P o h (• | s, a)∥ 1 ≤ Aρ, P h (• | s, •) ∈ ∆(S) A , where P o h is the nominal transition kernel at h, P o h (• | s, a) > 0, ∀(s, a) ∈ S × A, ρ is the level of uncertainty, and ∆(S) denotes the probability simplex over the state space S. Different from the (s, a)-rectangular assumption, which guarantees the existence of a deterministic optimal policy, the optimal policy under s-rectangular set may need to be randomized (Wiesemann et al., 2013) . We also remark that the requirement of P o h (• | s, a) > 0 is mostly for technical convenience. Equipped with the characterization of the uncertainty set, we now describe the learning protocols and the definition of regret under the robust MDP. Learning protocols and regret We consider a learning agent repeatedly interacts with the environment in an episodic manner, over K episodes. At the start of each episode, the learning agent picks a policy π k and interacts with the environment while executing π k . Without loss of generality, we assume the agents always start from a fixed initial state s. The performance of the learning agent is measured by the cumulative regret incurred over the K episodes. Under the robust MDP, the cumulative regret is defined to be the cumulative difference between the robust value of π k and the robust value of the optimal policy, Regret(K) = K k=1 V * 1 (s k 1 ) -V π k 1 (s k 1 ) , where s k 1 is the initial state. We highlight that the transition of the states in the learning process is specified by the nominal transition kernel {P o h } H h=1 , though the agent only has access to the nominal kernel in an online manner. We remark that if the agent is asked to interact with a potentially adversarially chosen transition from an arbitrary uncertainty set, the learning problem is NP-hard Even-Dar et al. ( 2004). One practical motivation for this formulation could be as follows. The policy provider only sees feedback from the nominal system, yet she aims to minimize the regret for clients who refuse to share additional deployment details for privacy purposes.

4. ALGORITHM

Before we introduce our algorithm, we first illustrate the importance of taking uncertainty into consideration. With the robust MDP, one of the most naive methods is to directly train a policy with the nominal transition model. However, the following proposition shows an optimal policy under the nominal policy can be arbitrarily bad in the worst-case transition (even worse than a random policy). Claim 4.1 (Suboptimality of non-robust optimal policy). There exists a robust MDP M = ⟨S, A, P, r, H⟩ with uncertainty set P of uncertainty radius ρ, such that the non-robust optimal policy is Ω(1)-suboptimal to the uniformly random policy. The proof of Proposition 4.1 is deferred to Appendix D. With the above-stated result, it implies the policy obtained with non-robust RL algorithms, can have arbitrarily bad performance when the dynamic mismatch from the nominal transition. Therefore, we present the following robust optimistic policy optimization 1 to avoid this undesired result.

4.1. ROBUST OPTIMISTIC POLICY OPTIMIZATION

With the presence of the uncertainty set, the optimal policies may be all randomized (Wiesemann et al., 2013) . In such cases, value-based methods may be insufficient as they usually rely on a deterministic policy. We thus resort to optimistic policy optimization methods Shani et al. (2020) , which directly learn a stochastic policy. Our algorithm performs policy optimization with empirical estimates and encourages exploration by adding a bonus to less explored states. However, we need to propose a new efficiently computable bonus that is robust to adversarial transitions. We achieve this via solving a sub-optimization problem derived from Fenchel conjugate. We present Robust Optimistic Policy Optimization (ROPO) in Algorithm 1 and elaborate on its design components. To start, as our algorithm has no access to the actual reward and transition function, we use the following empirical estimator of the transition and reward: rk h (s, a) = k-1 k ′ =1 R k ′ h (s, a)I s k ′ h = s, a k ′ h = a N k h (s, a) , P o,k h (s, a, s ′ ) = k-1 k ′ =1 I s k ′ h = s, a k ′ h = a, s k ′ h+1 = s ′ N k h (s, a) , where  N k h (s, a) = max k-1 k ′ =1 I s k ′ h = s, a k ′ h = a , k h (s, a) = O N k h (s, a) -1/2 . Qk h (s, a) = min r(s, a) + σ Ph ( V π h+1 )(s, a) + b k h (s, a), H , V k h (s) = Qk h (s, •), π k h (• | s) . Intuitively, the bonus term b k h desires to characterize the optimism required for efficient exploration for both the estimation errors of P and the robustness of P . It is hard to control the two quantities in their primal form because of the coupling between them. We propose the following procedure to address the problem. Note that the key difference between our algorithm and standard policy optimization is that σ Ph ( V π h+1 )(s) requires solving an inner minimization (1). Through relaxing the constraints with Lagrangian multiplier and Fenchel conjugates, under (s, a)-rectangular set, the inner minimization problem can be reduced to a one-dimensional unconstrained convex optimization problem on R (Lemma 4). sup η η - (η -min s V π k h+1 (s)) + 2 ρ - s ′ P o h (s ′ | s, a) η -V π k h+1 (s ′ ) + . ( ) The optimum of Equation ( 3) is then computed efficiently with bisection or sub-gradient methods. We note that while the dual form has been similarly used before under the presence of a generative model or with an offline dataset (Badrinath & Kalathil, 2021; Panaganti & Kalathil, 2022; Yang et al., 2021) , it remains unclear whether it is effective for the online setting. Similarly, in the case of s-rectangular set, the inner minimization problem is equivalent to a Adimensional convex optimization problem. sup η a ′ η a ′ - s ′ ,a ′ P o h (s ′ | s, a ′ ) η a ′ -I{a ′ = a} V π k h+1 (s ′ ) + -min s ′ ,a ′ Aρ(η a ′ -I{a ′ = a} V π k h+1 (s ′ )) + 2 , ( ) where a ∼ π k (s). In addition to reducing computational complexity, the dual form (Equation (3) and Equation ( 4)) decouples the uncertainty in estimation error and in robustness, as ρ and P o h are in different terms. The exact form of b k h is presented in the Equation ( 5) and ( 6).

Policy Improvement

Step Using the optimistic Q-value obtained from policy evaluation, the algorithm improves the policy with a KL regularized online mirror descent step, π k+1 h ∈ arg max π β⟨∇ V π k h , π⟩ -π k h + D KL (π||π k h ) , where β is the learning rate. Equivalently, the updated policy is given by the closed-form solution π k+1 h (a | s) = π k h exp(β Qπ h (s,a)) a ′ exp(β Qπ h (s,a ′ )) . An important property of policy improvement is to use a fundamental inequality (7) of online mirror descent presented in (Shani et al., 2020) . We suspect that other online algorithms with sublinear regret could also be used in policy improvement. In the non-robust case, this improvement step is also shown to be theoretically efficient (Shani et al., 2020; Wu et al., 2022) . Many empirically successful policy optimization algorithms, such as PPO (Schulman et al., 2017) and TRPO Schulman et al. (2015) , also take a similar approach to KL regularization for non-robust policy improvement. Putting everything together, the proposed algorithm is summarized in Algorithm 1.  + σ Ph ( V π h+1 )(s, a) + b k h (s, a), H . end for for ∀s ∈ S do V k h (s) = Qk h (s, •), π k h (• | s) . end for end for # Policy Improvement for ∀h, s, a ∈ [H] × S × A do π k+1 h (a | s) = π k h exp(β Qπ h (s,a)) a ′ exp(β Qπ h (s,a ′ )) . end for Update empirical estimate r, P with Equation (2). end for

5. THEORETICAL RESULTS

We are now ready to analyze the theoretical results of our algorithm under the uncertainly set.

5.1. RESULTS UNDER (s, a)-RECTANGULAR UNCERTAINTY SET

Equipped with Algorithm 1 and the bonus function described in Equation 5. We obtain the regret upper bound under (s, a)-rectangular uncertainty set described in the following Theorem. Theorem 1 (Regret under (s, a)-rectangular uncertainty set). With learning rate β = 2 log A H 2 K and bonus term b k h as (5), with probability at least 1 -δ, the regret incurred by Algorithm 1 over K episodes is bounded by Regret (K) = O H 2 S AK log SAH 2 K 3/2 (1 + ρ)/δ . Remark 5.1. When ρ = 0, the problem reduces to non-robust reinforcement learning. In such case our regret upper bound is Õ H 2 S √ AK , which is in the same order of policy optimization algorithms for the non-robust case Shani et al. (2020) . While we defer the detailed proof to the appendix A,we sketch and highlight the challenges in the proof below. First, unlike policy optimization for non-robust MDP, classic lemmas such as the value difference lemma (Shani et al., 2020) can be no longer applied, because the adversarial transition kernel are policy dependent. Naively employing a recursive relation with respect to a fixed transition kernel in a similar way to the value difference lemma may lead to linear regret. To address this issue, we propose the following decomposition, V * h (s) -V π k h (s) ≤ E π * (r h (s, a) -rk h (s, a)) + (σ P h (s,a) ( V π k h+1 )(s, a) -σ Ph (s,a) ( V π k h+1 )(s, a)) -b k h (s, a) + E π * σ P h (s,a) (V * h+1 )(s, a) -σ P h (s,a) ( V π k h+1 )(s, a) + ⟨ Qπ k h (s, •), π * (• | s) -π k (• | s)⟩ . In particular, we perform a recursion conditioned on varying transition kernel p h (• | s, a) = arg max P h ∈P h P h (• | s, a)( V π k h+1 -V π k h+1 ). However, this introduces another problem. Maintaining optimism is hard as the expectation of each time step h is taken with respect to a different transition kernel. To establish an optimism bonus for the uncertainty of the transition caused by limited interaction and the uncertainty set, we derive the dual formulation of inner optimization problem σ P(s,a) (V ) (3). This allows us to decouple the uncertainty and bound each source of uncertainty separately. Notice that now the difference of σ P(s,a) (V ) -σ P (s,a) (V ) is only incurred by the difference in s ′ P o h (s ′ | s, a) η -V π k h+1 (s ′ ) + . We then show that η must be bounded at its optimum by inspecting certain pivot points and by the convexity of the dual. When we have the desired bounds of η, applying Hoeffding's inequality with an ϵ-net argument will yield the claimed regret bound. Our algorithm and analysis techniques can also extend to other uncertainty sets, such as KL divergence constrained uncertainly set. We include the KL divergence result in Appendix C.

5.2. RESULTS UNDER s-RECTANGULAR UNCERTAINTY SET

Beyond the (s, a)-rectangular uncertainty set, we also extends to s-rectangular uncertainty set (Definition 3.2). Recall that value-based methods do not extend to s-rectangular uncertainty set as there might not exist a deterministic optimal policy. Theorem 2 (Regret under s-rectangular uncertainty set). With learning rate β = 2 log A H 2 K and bonus term b k h as (6), with probability at least 1 -δ, the regret of Algorithm 1 is bounded by Regret (K) = O SA 2 H 2 K log(SA 2 H 2 K 3/2 (1 + ρ)/δ) . Remark 5.2. When ρ = 0, the problem reduces to non-robust reinforcement learning. In such case our regret upper bound is Õ SA 2 H 2 √ K . Our result is the first theoretical result for learning a robust policy under s-rectangular uncertainty set, as previous results only learn the robust value function (Yang et al., 2021) . The analysis and techniques used for Theorem 2 hold great similarity to those ones used for Theorem 1. The main difference is on bounding σ Ph (s) ( V π k h+1 )(s, a) -σ P h (s) ( V π k h+1 )(s, a). We defer the detailed proof to the appendix B.

6. EMPIRICAL RESULTS

To validate our theoretical findings, we conduct a preliminary empirical analysis of our purposed robust policy optimization algorithm. Environment We conduct the experiments with the Gridworld environment, which is an early example of reinforcement learning from Sutton & Barto (2018). The environment is two-dimensional and is in a cell-like environment. Specifically, the environment is a 5 × 5 grid, where the agent starts from the upper left cell. The cells consist of three types, road (labeled with o), wall (labeled with x), or reward state (labeled with +). The agent can safely walk through the road cell but not the wall cell. Once the agent steps on the reward cell, it will receive a reward of 1, and it will receive no rewards otherwise. The goal of the agents is to collect as many rewards as possible within the allowed time. The agent has four types of actions at each step, up, down, left, and right. After taking the action, the agent has a success probability of p to move according to the desired direction, and with the remaining probability of moving to other directions. Experiment configurations To simulate the robust MDP, we create a nominal transition dynamic with success probability p = 0.9. The learning agent will interact with this nominal transition during training time and interact with a perturbed transition dynamic during evaluation.Under (s, a)-rectangular set, the transitions are perturbed against the direction is agent is directing with a constraint of ρ. Under s-rectangular set, the transitions are perturbed against the direction of the goal state. Figure 1 shows an example of our environment, where the perturbation caused some of the optimal policies under nominal transition to be sub-optimal under robust transitions. We denote the perturbed transition as robust transitions in our results. We implement our proposed robust policy optimization algorithm along with the non-robust variant of it Shani et al. (2020) . The inner minimization of our Algorithm 1 is computed through its dual formulation for efficiency. Our algorithm is implemented with the rLberry framework (Domingues et al., 2021) .

Results

We present results with ρ = 0.1, 0.2, 0.3 under (s, a)-rectangular set here in Figure 4 . The results with s-rectangular sets are included in the appendix. We present the averaged cumulative rewards during evaluation. Regardless of the level of uncertainty, we observe that the robust variant of the policy optimization algorithm is more robust to dynamic changes as it is able to obtain a higher level of rewards than its non-robust variant. 

7. CONCLUSION AND FUTURE DIRECTIONS

In this paper, we studied the problem of regret minimization in robust MDP with a rectangular uncertainty set. We proposed a robust variant of optimistic policy optimization, which achieves sublinear regret in all uncertainty sets considered. Our algorithm delicately balances the explorationexploitation trade-off through a carefully designed bonus term, which quantifies not only the uncertainty due to the limited observations but also the uncertainty of robust MDPs. Our results are the first regret upper bounds in robust MDPs as well as the first non-asymptotic results in robust MDPs without access to a generative model. For future works, while our analysis achieves the same bound as the policy optimization algorithm in Shani et al. (2020) when the robustness level ρ = 0, we suspect some technical details could be improved. For example, we required P o h to be positive for any s, a so that we could do a change of variable to form an efficiently solvable Fenchel dual. However, the actual positive value gets canceled out later and does not show up in the bound, suggesting that the strictly positive assumption might be an artifact of analysis. Furthermore, our work could also be extended in several directions. One is to consider other characterization of uncertainty sets, such as the Wasserstein distance metric. Another direction is to extend robust MDPs to a wider family of MDPs, such as the MDP with infinitely many states and with function approximation.



[A]: Panaganti & Kalathil (2022), [B]: Wang & Zou (2021), [C]: Badrinath & Kalathil (2021), [D]: Yang et al. (2021).

1 counts the number of visits to (s, a).Optimistic Robust Policy EvaluationIn each episode, the algorithm estimates Q-values with an optimistic variant of the bellman equation. Specifically, to encourage exploration in the robust MDP, we add a bonus term b k h (s, a), which compensates for the lack of knowledge of the actual reward and transition model as well as the uncertainly set, with order b

Robust Optimistic Policy Optimization (ROPO) Input: learning rate β, bonus function b k h . for k = 1, . . . , K do Collect a trajectory of samples by executing π k . # Robust Policy Evaluation for h = H, . . . , 1 do for ∀(s, a) ∈ S × A do Solve σ Ph ( V π h+1 )(s, a) according to Equation (3) for (s, a)-rectangular set or Equation (4) for s-rectangular set. Qk h (s, a) = min r(s, a)

Figure 1: Example of the Gridworld environment.

Figure 2: Cumulative rewards obtained by robust and non-robust policy optimization on robust transition with different level of uncertainty ρ = 0.1, 0.2, 0.3 under ℓ1 distance, (s, a)-rectangular set.

