HORIZON-FREE REINFORCEMENT LEARNING FOR LA-TENT MARKOV DECISION PROCESSES

Abstract

We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a modeloptimistic and a value-optimistic solver. We prove an r O `?M ΓSAK ˘regret bound where M is the number of contexts, S is the number of states, A is the number of actions, K is the number of episodes, and Γ ď S is the maximum transition degree of any state-action pair. The regret bound only scales logarithmically with the planning horizon, thus yielding the first (nearly) horizon-free regret bound for LMDP. Key in our proof is an analysis of the total variance of alpha vectors, which is carefully bounded by a recursion-based technique. We complement our positive result with a novel Ω `?M SAK ˘regret lower bound with Γ " 2, which shows our upper bound minimax optimal when Γ is a constant. Our lower bound relies on new constructions of hard instances and an argument based on the symmetrization technique from theoretical computer science, both of which are technically different from existing lower bound proof for MDPs, and thus can be of independent interest.

1. INTRODUCTION

One of the most popular model for Reinforcement Learning(RL) is Markov Decision Process (MDP), in which the transitions and rewards are dependent only on current state and agent's action. In standard MDPs, the agent has full observation of the state, so the optimal policy for the agent also only depends on states (called a history-independent policy). There is a line of research on MDPs, and the minimax regret and sample complexity guarantees have been derived. Another popular model is Partially Observable MDPs (POMDPs) in which the agent only has partial observations of states. Even though the underlying transition is still Markovian, the lower bound for sample complexity has been proven to be exponential in state and action sizes. This is in part because the optimal policies for POMDPs are history-dependent. In this paper we focus on a middle group between MDP and POMDP, namely Latent MDP (LMDP). An LMDP can be viewed as a collection of MDPs sharing the same state and action spaces, but the transitions and rewards may vary across them. Each MDP has a probability to be sampled at the beginning of each episode, and it will not change during the episode. The agent needs to find a policy which works well on these MDPs in an average sense. Empirically, LMDPs can be used for a wide variety of applications (Yu et al., 2020; Iakovleva et al., 2020; Finn et al., 2018; Ramamoorthy et al., 2013; Doshi-Velez & Konidaris, 2016; Yao et al., 2018) . In general, there exists no policy that is optimally on every single MDP simultaneously, so this task is definitely harder than MDPs. On the other hand, LMDP is a special case of POMDP because for each MDP, the unobserved state is static in each episode and the observable state is just the state of MDP. Unfortunately, for generic LMDPs, there exists exponential sample complexity lower bound (Kwon et al., 2021) , so additional assumptions are needed to make the problem tractable. In this paper, we consider the setting that after each episode ends, the agent will get the context on which MDP it played with. Such information is often available. For example, in a maze navigation task, the location of the goal state can be viewed as the context.

