EFFICIENT REINFORCEMENT LEARNING IN FACTORED MDPS WITH APPLICATION TO CONSTRAINED RL

Abstract

Reinforcement learning (RL) in episodic, factored Markov decision processes (FMDPs) is studied. We propose an algorithm called FMDP-BF, whose regret is exponentially smaller than that of optimal algorithms designed for non-factored MDPs, and improves on the previous FMDP result of Osband & Van Roy (2014b) by a factor of nH|S i |, where |S i | is the cardinality of the factored state subspace, H is the planning horizon and n is the number of factored transitions. We also provide a lower bound, which shows near-optimality of our algorithm w.r.t. timestep T , horizon H and factored state-action subspace cardinality. Finally, as an application, we study a new formulation of constrained RL, RL with knapsack constraints (RLwK), and provide the first sample-efficient algorithm based on FMDP-BF.

1. INTRODUCTION

Reinforcement learning (RL) is concerned with sequential decision making problems where an agent interacts with a stochastic environment and aims to maximize its cumulative rewards. The environment is usually modeled as a Markov Decision Process (MDP) whose transition kernel and reward function are unknown to the agent. A main challenge of the agent is efficient exploration in the MDP, so as to minimize its regret, or the related sample complexity of exploration. Extensive study has been done on the tabular case, in which almost no prior knowledge is assumed on the MDP dynamics. The regret or sample complexity bounds typically depend polynomially on the cardinality of state and action spaces (e.g., Strehl et al., 2009; Jaksch et al., 2010; Azar et al., 2017; Dann et al., 2017; Jin et al., 2018; Dong et al., 2019; Zanette & Brunskill, 2019) . Moreover, matching lower bounds (e.g., Jaksch et al., 2010) imply that these results cannot be improved without additional assumptions. On the other hand, many RL tasks involve large state and action spaces, for which these regret bounds are still excessively large. In many practical scenarios, one can often take advantage of specific structures of the MDP to develop more efficient algorithms. For example, in robotics, the state may be high-dimensional, but the subspaces of the state may evolve independently of others, and only depend on a lowdimensional subspace of the previous state. Formally, these problems can be described as factored MDPs (Boutilier et al., 2000; Kearns & Koller, 1999; Guestrin et al., 2003) . Most relevant to the present work is Osband & Van Roy (2014b), who proposed a posterior sampling algorithm and a UCRL-like algorithm that both enjoy

√

T regret, where T is the maximum timestep. Their regret bounds have a linear dependence on the time horizon and each factored state subspace. It is unclear whether this bound is tight or not. In this work, we tackle this problem by proposing algorithms with improved regret bounds, and developing corresponding lower bounds for episodic FMDPs. We propose a sample-and computationefficient algorithm called FMDP-BF based on the principle of optimism in the face of uncertainty, and prove its regret bounds. We also provide a lower bound, which implies that our algorithm is near-optimal with respect to the timestep T , the planning horizon H and factored state-action subspace cardinality |X [Z i ]|. As an application, we study a novel formulation of constrained RL, known as RL with knapsack constraints (RLwK), which we believe is natural to capture many scenarios in real-life applications. We apply FMDP-BF to this setting, to obtain a statistically efficient algorithm with a regret bound that is near-optimal in terms of T , S, A, and H. Our contributions are summarized as follows: 1. We propose an algorithm for FMDP, and prove its regret bound that improves on the previous result of Osband & Van Roy (2014b) by a factor of nH|S i |. 2. We prove a regret lower bound for FMDP, which implies that our regret bound is nearoptimal in terms of timestep T , horizon H and factored state-action subspace cardinality. 3. We apply FMDP-BF in RLwK, a novel constrained RL setting with knapsack constraints, and prove a regret bound that is near-optimal in terms of T, S, A and H.

2. PRELIMINARIES

We consider the setting of a tabular episodic Markov decision process (MDP), (S, A, H, P, R), where S is the set of states, A is the action set, H is the number of steps in each episode. P is the transition probability matrix so that P(•|s, a) gives the distribution over states if action a is taken on state s, and R(s, a) is the reward distribution of taking action a on state s with support [0, 1]. We use R(s, a) to denote the expectation E[R(s, a)]. In each episode, the agent starts from an initial state s 1 that may be arbitrarily selected. At each step h ∈ [H], the agent observes the current state s h ∈ S, takes action a h ∈ A, receives a reward r h sampled from R(s h , a h ), and transits to state s h+1 with probability P(s h+1 |s h , a h ). The episode ends when s H+1 is reached. A policy π is a collection of H policy functions {π h : S → A} h∈ [H] . We use V π h : S → R to denote the value function at step h under policy π, which gives the expected sum of remaining rewards received under policy π starting from s h = s, i.e. V π h (s) = E H h =h R(s h , π h (s h )) | s h = s . Accordingly, we define Q π h (s, a) as the expected Q-value function at step h: Q π h (s, a) = E R(s h , a h ) + H h =h+1 R(s h , π h (s h )) | s h = s, a h = a . We use V * h and Q * h to denote the optimal value and Q-functions under optimal policy π * at step h. The agent interacts with the environment for K episodes with policy π k = {π k,h : S → A} h∈[H] determined before the k-th episode begins. The agent's goal is to maximize its cumulative rewards K k=1 H h=1 r k,h over T = KH steps, or equivalently, to minimize the following expected regret: Reg(K) def = K k=1 [V * 1 (s k,1 ) -V π k 1 (s k,1 )] , where s k,1 is the initial state of episode k.

2.1. FACTORED MDPS

A factored MDP is an MDP whose rewards and transitions exhibit certain conditional independence structures. We start with the formal definition of factored MDP (Boutilier et al., 2000; Osband & Van Roy, 2014b; Xu & Tewari, 2020; Lu & Van Roy, 2019) . Let P(X , Y) denote the set of functions that map x ∈ X to the probability distribution on Y.

