UNRAVEL STRUCTURED HETEROGENEITY OF TASKS IN META-REINFORCEMENT LEARNING VIA EX-PLORATORY CLUSTERING

Abstract

Meta-reinforcement learning (meta-RL) is developed to quickly solve new tasks by leveraging knowledge from prior tasks. The assumption that tasks are drawn IID is typically made in previous studies, which ignore possible structured heterogeneity of tasks. The non-transferable knowledge caused by structured heterogeneity hinders fast adaptation in new tasks. In this paper, we formulate the structured heterogeneity of tasks via clustering such that transferable knowledge can be inferred within different clusters and non-transferable knowledge would be excluded across clusters thereby. To facilitate so, we develop a dedicated exploratory policy to discover task clusters by reducing uncertainty in posterior inference. Within the identified clusters, the exploitation policy is able to solve related tasks by utilizing knowledge shared within the clusters. Experiments on various MuJoCo tasks showed the proposed method can unravel cluster structures effectively in both rewards and state dynamics, proving strong advantages against a set of state-of-the-art baselines.

1. INTRODUCTION

Conventional reinforcement learning (RL) is notorious for sample inefficiency, which often requires millions of interactions with the environment to learn a performing policy for a new task. Inspired by the human learning process, meta-reinforcement learning (meta-RL) is proposed to quickly learn new tasks by leveraging knowledge shared by related tasks (Finn et al., 2017; Duan et al., 2016; Wang et al., 2016) . Extensive efforts have been put into learning and utilizing transferable knowledge in meta-RL. For example, Finn et al. (2017) proposed to learn a set of shared meta parameters which is used to initialize the local policy when a new task comes. Duan et al. (2016) and Wang et al. (2016) trained an RNN encoder to characterize prior tasks according to the interaction history. However, little attention has been paid to the situations where some knowledge is only locally transferable among tasks. All the aforementioned methods implicitly assume tasks have substantially shared structures, and thus knowledge can be broadly shared across all tasks. However, heterogeneity among tasks exists in practice, if not prevails, which hampers the effectiveness of existing meta-RL algorithms. For example, the necessary skills for the Go game can hardly be applied to the Gomoku game, though both of them operate on the same chessboards. We formulate this scenario as a more complicated but also more realistic meta-RL setting where tasks are originated from different distributions, i.e., tasks are clustered. As a result, some knowledge is locally transferable within clusters, but cannot be shared globally. We refer to this as structured heterogeneity among RL tasks, and explicitly model cluster structures in the task distribution to capture cluster-level knowledgefoot_0 . Structured heterogeneity has been studied in supervised meta-learning (Yao et al., 2019) ; but it is a lot more challenging to be handled in meta-RL, where the key bottleneck is how to unravel the clustering structure in a population of RL tasks. This can be further decomposed into two key research questions, namely population-level structure estimation and task-level inference. Different from supervised learning tasks where static task-specific data is already provided before meta learning, the observations in RL tasks are collected by an agent's interactions with the environment. As a result, in addition to the original explore-exploit trade-off an RL agent needs to handle for return maximization, now a meta-RL agent also has to balance the trade-off between the need of clustering structure identification and the need of return maximization when taking actions. Oftentimes, the action benefiting return maximization does not necessarily help clustering structure identification. The situation becomes more pressing when solving new tasks at test time, where the agent needs to differentiate what is transferable at the cluster-level, task-level, or not at all, for fast adaptation. To handle the structured heterogeneity of tasks in meta-RL, we propose a cluster-based meta-RL algorithm, called MILET: Meta reInforcement Learning via Exploratory clusTering, which is designed to explore clustering structures of tasks and enhance fast adaptation on new tasks with clusterlevel transferable knowledge. Specifically, we perform cluster-based posterior inference (Rao et al., 2019; Dilokthanakul et al., 2016) to infer the clustering structure of a new task. To accelerate cluster inference, we learn a dedicated exploration policy that is aware of cluster structures and is trained to explore the environment to reduce uncertainty in cluster inference as fast as possible. Furthermore, we design a composed reward function for it to encourage a coarse-to-fine exploratory strategy. New tasks can be quickly adapted with the explored cluster information by using locally transferable knowledge within clusters. We compare our method with a rich set of state-of-the-art meta-RL baselines on various MuJoCo environments (Todorov et al., 2012) with cluster structures in both rewards and state dynamics. We also show our method can mitigate the sparse reward issue by sample-efficient exploration on cluster structures, which provides hints to solve a specific task. We further test our method on environments without explicit clustering structures, the results show our method can automatically discover locally transferable knowledge and benefit the adaptation.

2. RELATED WORK

Task modeling in meta-learning. Task modeling is important to realize fast adaptation in new tasks in meta learning. Finn et al. ( 2017) first proposed the model-agnostic meta learning (MAML) aiming to learn a shared model initialization, i.e., the meta model, given a population of tasks. MAML does not explicitly model tasks, but it expects the meta model to be only a few steps of gradient update away from all tasks. Later, an array of methods extend MAML by explicitly modeling tasks using given training data in supervised meta-learning. Lee & Choi (2018) learned a task-specific subspace of each layer's activation, on which gradient-based adaptation is performed. Vuorio et al. (2019) explicitly learned task embeddings given data points from each task, and then used it to generate task-specific meta model. Yao et al. (2019) adopted a hierarchical task clustering structure, which enables cluster-specific meta model. Such a design encourages the solution to capture locally transferable knowledge inside each cluster, similar to our MILET model. However, task information is not explicit in meta-RL: since the true reward/state transition functions are not accessible by the agent, the agent needs to interact with the environment to collect observations about the tasks, while maximizing the return from the interactions. MILET performs posterior inference of a task's cluster assignment based on its ongoing trajectory; better yet, it is designed to behave exploratorily to quickly identify tasks' clustering structures. Exploration in meta-reinforcement learning. Exploration plays an important role in meta-RL, as the agent can only learn from its interactions with the environment. In gradient-based meta-RL (Finn et al., 2017) , the local policy is trained on the trajectories collected by the meta policy, thus the exploration for task structure is not explicitly handled. 2021) developed a separate exploration policy by maximizing the mutual information between task ids and inferred task embeddings. However, all the aforementioned exploration methods are oblivious to the structured heterogeneity of tasks, leading to inefficient exploration and ignorance of locally transferable knowledge within clusters.



In this paper, we do not assume the knowledge in different clusters is exclusive, and thus each cluster can also contain overlapping global knowledge, e.g., motor skills in locomotion tasks.



Stadie et al. (2018) and Rothfuss et al. (2019) computed gradients w.r.t. the sampling distribution of the meta policy, in addition to the collected trajectories. Gurumurthy et al. (2020) learned a separate exploration policy for MAML to collect informative pre-update trajectories. Gupta et al. (2018) also extended MAML by using learnable latent variables to control different exploration behaviors. The context-based meta-RL algorithms (Duan et al., 2016; Wang et al., 2016) automatically learn to trade-off exploration and exploitation by learning a policy conditioned on the current context. Zintgraf et al. (2020) explicitly provided the task uncertainty to the policy to facilitate exploration via variational inference. Zhou et al. (2019) introduced intrinsic rewards encouraging exploration that can improve the prediction of dynamics. Zhang et al. (2021) and Liu et al. (

