LEARNING ROBUST STATE ABSTRACTIONS FOR HIDDEN-PARAMETER BLOCK MDPS

Abstract

Many control tasks exhibit similar dynamics that can be modeled as having common latent structure. Hidden-Parameter Markov Decision Processes (HiP-MDPs) explicitly model this structure to improve sample efficiency in multi-task settings. However, this setting makes strong assumptions on the observability of the state that limit its application in real-world scenarios with rich observation spaces. In this work, we leverage ideas of common structure from the HiP-MDP setting, and extend it to enable robust state abstractions inspired by Block MDPs. We derive instantiations of this new framework for both multi-task reinforcement learning (MTRL) and meta-reinforcement learning (Meta-RL) settings. Further, we provide transfer and generalization bounds based on task and state similarity, along with sample complexity bounds that depend on the aggregate number of samples across tasks, rather than the number of tasks, a significant improvement over prior work that use the same environment assumptions. To further demonstrate the efficacy of the proposed method, we empirically compare and show improvement over multi-task and meta-reinforcement learning baselines.

1. INTRODUCTION

A key open challenge in AI research that remains is how to train agents that can learn behaviors that generalize across tasks and environments. When there is common structure underlying the tasks, we have seen that multi-task reinforcement learning (MTRL), where the agent learns a set of tasks simultaneously, has definite advantages (in terms of robustness and sample efficiency) over the singletask setting, where the agent independently learns each task. There are two ways in which learning multiple tasks can accelerate learning: the agent can learn a common representation of observations, and the agent can learn a common way to behave. Prior work in MTRL has also leveraged the idea by sharing representations across tasks (D'Eramo et al., 2020) or providing pertask sample complexity results that show improved sample efficiency from transfer (Brunskill & Li, 2013) . However, explicit exploitation of the shared structure across tasks via a unified dynamics has been lacking. Prior works that make use of shared representations use a naive unification approach that posits all tasks lie in a shared domain (Figure 1 , left). On the other hand, in the single-task setting, research on state abstractions has a much richer history, with several works on improved generalization through the aggregation of behaviorally similar states (Ferns et al., 2004; Li et al., 2006; Luo et al., 2019; Zhang et al., 2020b) . In this work, we propose to leverage rich state abstraction models from the single-task setting, and explore their potential for the more general multi-task setting. We frame the problem as a structured super-MDP with a shared state space and universal dynamics model conditioned on a task-specific hidden parameter (Figure 1 , right). This additional structure gives us better sample efficiency, both theoretically, compared to related bounds (Brunskill & Li, 2013; Tirinzoni et al., 2020) and empirically against relevant baselines (Yu et al., 2020; Rakelly et al., 2019; Chen et al., 2018; Teh et al., 2017) . We learn a latent representation with smoothness properties for better few-shot generalization to other unseen tasks within this family. This allows us to derive new value loss bounds and sample complexity bounds that depend on how far away a new task is from the ones already seen. We focus on multi-task settings where dynamics can vary across tasks, but the reward function is shared. We show that this setting can be formalized as a hidden-parameter MDP (HiP-MDP) (Doshi-Velez & Konidaris, 2013) , where the changes in dynamics can be defined by a latent variable, unifying dynamics across tasks as a single global function. This setting assumes a global latent structure over all tasks (or MDPs). Many real-world scenarios fall under this framework, such as autonomous driving under different weather and road conditions, or even different vehicles, which change the dynamics of driving. Another example is warehouse robots, where the same tasks are performed in different conditions and warehouse layouts. The setting is also applicable to some cases of RL for medical treatment optimization, where different patient groups have different responses to treatment, yet the desired outcome is the same. With this assumed structure, we can provide concrete zero-shot generalization bounds to unseen tasks within this family. Further, we explore the setting where the state space is latent and we have access to only high-dimensional observations, and we show how to recover robust state abstractions in this setting. This is, again, a highly realistic setting in robotics when we do not always have an amenable, Lipschitz low-dimensional state space. Cameras are a convenient and inexpensive way to acquire state information, and handling pixel observations is key to approaching these problems. A block MDP (Du et al., 2019) provides a concrete way to formalize this observation-based setting. Leveraging this property of the block MDP framework, in combination with the assumption of a unified dynamical structure of HiP-MDPs, we introduce the hidden-parameter block MDP (HiP-BMDP) to handle settings with high-dimensional observations and structured, changing dynamics. Key contributions of this work are a new viewpoint of the multi-task setting with same reward function as a universal MDP under the HiP-BMDP setting, which naturally leads to a gradientbased representation learning algorithm. Further, this framework allows us to compute theoretical generalization results with the incorporation of a learned state representation. Finally, empirical results show that our method outperforms other multi-task and meta-learning baselines in both fast adaptation and zero-shot transfer settings.

2. BACKGROUND

In this section, we introduce the base environment as well as notation and additional assumptions about the latent structure of the environments and multi-task setup considered in this work. A finitefoot_0 , discrete-time Markov Decision Process (MDP) (Bellman, 1957; Puterman, 1995) is a tuple S, A, R, T, γ , where S is the set of states, A is the set of actions, R : S × A → R is the reward function, T : S × A → Dist(S) is the environment transition probability function, and γ ∈ [0, 1) is the discount factor. At each time step, the learning agent perceives a state s t ∈ S, takes an action a t ∈ A drawn from a policy π : S × A → [0, 1], and with probability T (s t+1 |s t , a t ) enters next state s t+1 , receiving a numerical reward R t+1 from the environment. The value function of policy π is defined as: V π (s) = E π [ ∞ t=0 γ t R t+1 |S 0 = s]. The optimal value function V * is the maximum value function over the class of stationary policies. Hidden-Parameter MDPs (HiP-MDPs) (Doshi-Velez & Konidaris, 2013) can be defined by a tuple M: S, A, Θ, T θ , R, γ, P Θ where S is a finite state space, A a finite action space, T θ describes the transition distribution for a specific task described by task parameter θ ∼ P Θ , R is the reward function, γ is the discount factor, and P Θ the distribution over task parameters. This defines a family of MDPs, where each MDP is described by the parameter θ ∼ P Θ . We assume that this parameter θ is fixed for an episode and indicated by an environment id given at the start of the episode. Block MDPs (Du et al., 2019) are described by a tuple S, A, X , p, q, R with an unobservable state space S, action space A, and observable space X . p denotes the latent transition distribution p(s |s, a) for s, s ∈ S, a ∈ A, q is the (possibly stochastic) emission mapping that emits the observations q(x|s) for x ∈ X , s ∈ S, and R the reward function. We are interested in the setting where this



We use this assumption only for theoretical results, but our method can be applied to continuous domains.



Figure 1: Visualizations of the typical MTRL setting and the HiP-MDP setting.

