HORIZON-FREE REINFORCEMENT LEARNING FOR LA-TENT MARKOV DECISION PROCESSES

Abstract

We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a modeloptimistic and a value-optimistic solver. We prove an r O `?M ΓSAK ˘regret bound where M is the number of contexts, S is the number of states, A is the number of actions, K is the number of episodes, and Γ ď S is the maximum transition degree of any state-action pair. The regret bound only scales logarithmically with the planning horizon, thus yielding the first (nearly) horizon-free regret bound for LMDP. Key in our proof is an analysis of the total variance of alpha vectors, which is carefully bounded by a recursion-based technique. We complement our positive result with a novel Ω `?M SAK ˘regret lower bound with Γ " 2, which shows our upper bound minimax optimal when Γ is a constant. Our lower bound relies on new constructions of hard instances and an argument based on the symmetrization technique from theoretical computer science, both of which are technically different from existing lower bound proof for MDPs, and thus can be of independent interest. 1 Their original bound is r Op ? M S 2 AH 3 Kq with the scaling that the reward from each step is bounded by 1. We rescale the reward to be bounded by 1{H in order to make the total reward from each episode bounded by 1, which is the setting we consider.

1. INTRODUCTION

One of the most popular model for Reinforcement Learning(RL) is Markov Decision Process (MDP), in which the transitions and rewards are dependent only on current state and agent's action. In standard MDPs, the agent has full observation of the state, so the optimal policy for the agent also only depends on states (called a history-independent policy). There is a line of research on MDPs, and the minimax regret and sample complexity guarantees have been derived. Another popular model is Partially Observable MDPs (POMDPs) in which the agent only has partial observations of states. Even though the underlying transition is still Markovian, the lower bound for sample complexity has been proven to be exponential in state and action sizes. This is in part because the optimal policies for POMDPs are history-dependent. In this paper we focus on a middle group between MDP and POMDP, namely Latent MDP (LMDP). An LMDP can be viewed as a collection of MDPs sharing the same state and action spaces, but the transitions and rewards may vary across them. Each MDP has a probability to be sampled at the beginning of each episode, and it will not change during the episode. The agent needs to find a policy which works well on these MDPs in an average sense. Empirically, LMDPs can be used for a wide variety of applications (Yu et al., 2020; Iakovleva et al., 2020; Finn et al., 2018; Ramamoorthy et al., 2013; Doshi-Velez & Konidaris, 2016; Yao et al., 2018) . In general, there exists no policy that is optimally on every single MDP simultaneously, so this task is definitely harder than MDPs. On the other hand, LMDP is a special case of POMDP because for each MDP, the unobserved state is static in each episode and the observable state is just the state of MDP. Unfortunately, for generic LMDPs, there exists exponential sample complexity lower bound (Kwon et al., 2021) , so additional assumptions are needed to make the problem tractable. In this paper, we consider the setting that after each episode ends, the agent will get the context on which MDP it played with. Such information is often available. For example, in a maze navigation task, the location of the goal state can be viewed as the context. In this setting, Kwon et al. ( 2021) obtained an r Op ? M S 2 AHKq regret upper bound where M is the number of contexts, S is the number of states, A is the number of actions, H is the planning horizon, and K is the number of episodes. They did not study the regret lower bound. 1 To benchmark this result, the only available bound is r Θ `?SAK ˘from standard MDP by viewing MDP as a special case of LMDP. Comparing these two bounds, we find significant gaps: ① Is the dependency on M in LMDP necessary? ② The bound for MDP is (nearly) horizon-free (no dependency on H), is the polynomial dependency on H in LMDP necessary? ③ The dependency on the number of states is ? S for MDP but the bound in Kwon et al. (2021) for LMDP is S. In this paper, we resolve the first two questions and partially answer the third.

1.1. MAIN CNTRIBUTIONS AND TECHNICAL NOVELTIES

We obtain the following new results: ‚ Near-optimal regret guarantee for LMDPs. We present an algorithm framework for LMDPs with context in hindsight. This framework can be instantiated with a plug-in solver for planning problems. We consider two types of solvers, one model-optimistic and one value-optimistic, and prove their regret bound to be r O `?M ΓSAK ˘where Γ ď S is the maximum transition degree of any state-action pair. Compared with the result in Kwon et al. (2021) , ours only requires the total reward to be bounded whereas they required a bounded reward for each step. Furthermore, we improve the H-dependence from ? H to logarithmic, making our bound (nearly) horizon-free. Lastly, our bound scales with ? SΓ, which is strictly better than S in their bound. The main technique of our model-optimistic algorithm is to use a Bernstein-type confidence set on each position of transition dynamics, leading to a small Bellman error. The main difference between our value-optimistic algorithm and Kwon et al. ( 2021)'s is that we use a bonus depending on the variance of next-step values according to Bennett's inequality, instead of using Bernstein's inequality. It helps propagate the optimism from the last step to the first step, avoiding the Hdependency. We analyse these two solvers in a unified way, as their Bellman error are of the same order. ‚ New regret lower bound for LMDPs. We obtain a novel Ω `?M SAK ˘regret lower bound for LMDPs. This regret lower bound shows the dependency on M is necessary for LMDPs. Notably the lower bound also implies r O `?M ΓSAK ˘upper bound is optimal up to a ? Γ factor. Furthermore, our lower bound holds even for Γ " 2, which shows our upper bound is minimax optimal for a class of LMDPs with Γ " Op1q. Our proof relies on new constructions of hard instances, different from existing ones for MDPs (Domingues et al., 2021) . In particular, we use a two-phase structure to construct hard instances (cf. Figure 1 ). Furthermore, the previous approaches for proving lower bounds of MDPs do not work on LMDPs. For example, in the MDP instance of Domingues et al. ( 2021), the randomness comes from the algorithm and the last transition step before entering the good state or bad state. In an LMDP, the randomness of sampling the MDP from multiple MDPs must also be considered. Such randomness not only dilutes the value function by averaging over each MDP, but also divides the pushforward measure (see Page 3 of Domingues et al. ( 2021)) into M parts. As a result, the M terms in KL divergence in Equation (2) of Domingues et al. (2021) and that in Equation (10) cancels out -the final lower bound does not contain M . To overcome this, we adopt the symmetrization technique from theoretical computer science. This novel technique is helps generalize the bounds from a single-party result to a multiple-party result, which may give rise to a tighter lower bound.

