UNSUPERVISED TASK CLUSTERING FOR MULTI-TASK REINFORCEMENT LEARNING Anonymous

Abstract

Meta-learning, transfer learning and multi-task learning have recently laid a path towards more generally applicable reinforcement learning agents that are not limited to a single task. However, most existing approaches implicitly assume a uniform similarity between tasks. We argue that this assumption is limiting in settings where the relationship between tasks is unknown a-priori. In this work, we propose a general approach to automatically cluster together similar tasks during training. Our method, inspired by the expectation-maximization algorithm, succeeds at finding clusters of related tasks and uses these to improve sample complexity. We achieve this by designing an agent with multiple policies. In the expectation step, we evaluate the performance of the policies on all tasks and assign each task to the best performing policy. In the maximization step, each policy trains by sampling tasks from its assigned set. This method is intuitive, simple to implement and orthogonal to other multi-task learning algorithms. We show the generality of our approach by evaluating on simple discrete and continuous control tasks, as well as complex bipedal walker tasks and Atari games. Results show improvements in sample complexity as well as a more general applicability when compared to other approaches.

1. INTRODUCTION

Figure 1 : An agent (smiley) should reach one of 12 goals (stars) in a grid world. Learning to reach a goal in the top right corner helps him to learn about the other goals in that corner. However, learning to reach the green stars (bottom left corner) at the same time gives conflicting objectives, hindering training. Task clustering resolves the issue. Imagine we are given an arbitrary set of tasks. We know that dissimilarities and/or contradicting objectives can exist. However, in most settings we can only guess these relationships and how they might affect joint training. Many recent works rely on such human guesses and (implicitly or explicitly) limit the generality of their approaches. This can lead to impressive results, either by explicitly modeling the relationships between tasks as in transfer learning (Zhu et al., 2020) , or by meta learning implicit relations (Hospedales et al., 2020) . However, in some cases an incorrect similarity assumption can hurt learning performance (Lazaric, 2012) . Our aim with this paper is to provide an easy, straightforward approach to avoid human assumptions on task similarities. An obvious solution is to train a separate policy for each task. However, this leads to a large amount of experience being required to learn the desired behaviors. Therefore, it is desirable to have a single agent and allow the sharing of knowledge between tasks. This is generally known as multi-task learning, a field which has received a large amount of interest in both the supervised learning and reinforcement learning (RL) community (Zhang & Yang, 2017) . If tasks are sufficiently similar, a policy that is trained on one task provides a good starting point for another task, and experience from each task will help training in the other tasks. This is known as positive transfer (Lazaric, 2012). However, if the tasks are sufficiently dissimilar, negative transfer occurs and reusing a pre-trained policy is disadvantageous. It can even lead to a worse performance than simply starting with a random initialization. Here using experience from the other tasks might slow training or even prevent con-vergence to a good policy. Most previous approaches to multi-task learning do not account for problems caused by negative transfer directly and either accept its occurrence or limit their experiments to sufficiently similar tasks. We present a hybrid approach that is helpful in a setting where the task set contains clusters of related tasks, amongst which transfer is helpful. To illustrate the intuition we provide a conceptualized example in Figure 1 . The figure shows a grid world with 12 tasks that can be naturally clustered in 4 clusters. Note however that our approach goes beyond this conceptual ideal and can be beneficial even if the clustering is not perceivable by humans a-priori. Our approach is inspired by the expectation-maximization framework and uses a set of completely separate policies within our agent. We iteratively evaluate the set of policies on all tasks, assign tasks to policies based on their respective performance and train policies on their assigned tasks. This leads to policies naturally specializing to clusters of related tasks, yielding an interpretable decomposition of the full task set. Moreover, we show that our approach can improve the learning speed and final reward in multi-task RL settings. To summarize our contributions: • We propose a general approach inspired by Expectation-Maximization (EM) that can find clusters of related tasks in an unsupervised manner during training. • We provide an evaluation on a diverse set of multi-task RL problems that shows the improved sample complexity and reduction in negative transfer in our approach. • We show the importance of meaningful clustering and the sensitivity to the assumed number of clusters in an ablation study

2. RELATED WORK

Expectation-Maximization (EM) has previously been used in RL to directly learn a policy. By reformulating RL as an inference problem with a latent variable, it is possible to use EM to find the maximum likelihood solution, corresponding to the optimal policy. We direct the reader to Deisenroth et al. ( 2013) for a survey on the topic. Our approach is different: We use an EM-inspired approach to cluster tasks in a multi-task setting and rely on recent RL algorithms to learn the tasks. In supervised learning, the idea of subdividing tasks into related clusters was proposed by Thrun & O'Sullivan (1996) . They use a distance metric based on generalization accuracy to cluster tasks. Another popular idea related to our approach that emerged from supervised learning is the use of a mixture of experts (Jacobs et al., 1991) . Here, multiple sub-networks are trained together with an input dependent gating network. Jordan & Jacobs (1993) also proposed an EM algorithm to learn the mixture of experts. While those approaches have been extended to the control setting (Jacobs & Jordan, 1990; 1993; Meila & Jordan, 1995; Cacciatore & Nowlan, 1993; Tang & Hauser, 2019) , they rely on an explicit supervision signal. It is not clear how such an approach would work in an RL setting. A variety of other methods have been proposed in the supervised learning literature, for brevity we direct the reader to the survey by Zhang & Yang (2017) , which provides a good overview of the topic. Our work differs in that we focus on RL, where no labeled data set exists. In RL, task clustering has in the past received attention in works on transfer learning. Carroll & Seppi (2005) proposed to cluster tasks based on a distance function. They propose distances based on Q-values, reward functions, optimal policies or transfer performance. They propose to use the clustering to guide transfer. Similarly, Mahmud et al. (2013) propose a method for clustering Markov Decision Processes (MDPs) for source task selection. They design a cost function for their chosen transfer method and derive an algorithm to find a clustering that minimizes this cost function. Our approach differs from both in that we do not assume knowledge of the underlying MDPs and corresponding optimal policies. Furthermore, the general nature of our approach allows it to scale to complex tasks, where comparing properties of the full underlying MDPs is not feasible. An earlier approach by Wilson et al. (2007) developed a hierarchical Bayesian approach for multi-task RL. Their approach uses a Dirichlet process to cluster the distributions from which they sample full MDPs in the hope that the sampled MDP aligns with the task at hand. They then solve the sampled MDP and use the resulting policy to gather data from the environment and refine the posterior distributions for a next iteration. While their method is therefore limited to simple MDPs, our approach can be combined with function approximation and therefore has the potential to scale to MDPs with large or infinite state spaces which cannot be solved in closed form. Lazaric & Ghavamzadeh (2010) use a hierarchical Bayesian approach to infer the parameters of a linear value function and utilize

