HIERARCHICAL META REINFORCEMENT LEARNING FOR MULTI-TASK ENVIRONMENTS Anonymous

Abstract

Deep reinforcement learning algorithms aim to achieve human-level intelligence by solving practical decisions-making problems, which are often composed of multiple sub-tasks. Complex and subtle relationships between sub-tasks make traditional methods hard to give a promising solution. We implement a first-person shooting environment with random spatial structures to illustrate a typical representative of this kind. A desirable agent should be capable of balancing between different sub-tasks: navigation to find enemies and shooting to kill them. To address the problem brought by the environment, we propose a Meta Soft Hierarchical reinforcement learning framework (MeSH), in which each low-level sub-policy focuses on a specific sub-task respectively and high-level policy automatically learns to utilize low-level sub-policies through meta-gradients. The proposed framework is able to disentangle multiple sub-tasks and discover proper low-level policies under different situations. The effectiveness and efficiency of the framework are shown by a series of comparison experiments. Both environment and algorithm code will be provided for open source to encourage further research.

1. INTRODUCTION

With great breakthrough of deep reinforcement learning (DRL) methods (Mnih et al., 2015; Silver et al., 2016; Mnih et al., 2016; Schulman et al., 2015; Lillicrap et al., 2015) , it is an urgent need to use DRL methods to solve more complex decision-making problems. The practical problem in real world is often a subtle combination of multiple sub-tasks, which may happen simultaneously and hard to disentangle by time series. For instance, in StarCraft games (Pang et al., 2019) , agents need to consider building units and organizing battles, sub-tasks may change rapidly over the whole game process; sweeping robots tradeoff between navigating and collecting garbage; shooting agents should move to appropriate positions and launch attacks, etc. The relationship between sub-tasks is complex and subtle. Sometimes they compete with each other and need to focus on one task to gain key advantages; at other times, they need to cooperate with each other to maintain the possibility of global exploration. It is often time consuming and ineffective to learn simply by collecting experience and rewarding multiple objectives for different sub-tasks. A reasonable idea is to utilize deep hierarchical reinforcement learning (DHRL) methods (Vezhnevets et al., 2017; Igl et al., 2020) , where the whole system is divided into a high-level agent and several low-level agents. Low-level agents learn sub-policies, which select atomic actions for corresponding sub-tasks. The high-level agent is responsible for a meta task in the abstract logic or coarser time granularity, guiding low-level agents by giving a goal, or directly selecting among subpolicies. However, DHRL methods face some inherent problems: due to the complex interaction between multi-level agents, there is no theoretical guarantee of convergence, and it shows unstable experimental performance. Most DHRL methods require heavy manual design, and end-to-end system lacks reasonable semantic interpretation. In addition, these agents are often constrained by specific tasks and are easy to overfit. Even transferring between similar tasks, they perform poorly and need a lot of additional adjustments. We introduce a first-person shooting (FPS) environment with random spatial structures. The game contains a 3D scene from human perspective. When the player defeats all enemies, the player wins the game. When the player drops to the ground or losses all health points, the player loses the game. It is very risky for the player to drop to the ground, thus environment contains two key tasks: navigation and combat. The terrain and enemies in the game are randomly generated. This ensures: 1) the agent cannot learn useful information by memorizing coordinates; 2) the possibility of overfitting is restrained and the generalization ability of learned policy is enhanced. The state information is expressed in the way of raycast. This representation of environment information requires much less computing resources than the raw image representation. It can be trained and tested even with only CPU machines, which makes us pay more attention to the reinforcement learning algorithm itself rather than the computing ablity related to image processing. For this environment, we propose a Meta Soft Hierarchical reinforcement learning framework (MeSH). The high-level policy is a differentiable meta parameter generator, and the low-level policy contains several sub-policies, which are in the same form and differentiated automatically in the training procedure. The high-level policy selects and combines sub-policies through the meta parameter and interacts with the environment. We find that the meta generator can adaptively combines sub-policies with the process of the task, and have strong interpretability in semantics. Compared with a series of baselines, the agent has achieved excellent performance in FPS environment. The main contributions of this work are as follows: • clarifying the complex relationship between multi-task composition. • a novel meta soft hierarchical reinforcement learning framework, MeSH, which uses differentiable meta generator to adaptively select sub-policies and shows strong interpretability. • a series of comparison experiments to show the effectiveness of the framework. • an open-sourced environment and code to encourage further research on multi-task RLfoot_0 . In this paper, we discuss the related work in Section 2. We introduce the details of the implemented environment in Section 3. We show our proposed framework in Section 4. We demonstrate details of our experiments in Section 5. At last, we conclude in Section 6.

2. RELATED WORK

In decision-making problems with high-dimensional continuous state space, the agent often needs to complete tasks that contain multiple sub-tasks. To complete taxi agent problem (Dietterich, 2000) , the agent needs to complete sub-tasks such as pickup, navigate, putdown. Menashe & Stone (2018) proposed Escape Room Domain, which is a testbed for HRL. The agent leaves the room from the starting point and needs to press four buttons of different colors to leave the room. In these environments, the agent needs to optimize several sub-tasks and minimize the mutual negative influence between them. However, sub-tasks in these environments are timing dependent. The proposed methods above are helpless in a multi-task environment that needs to fulfill multiple tasks simultaneously. Architectural solutions use hierarchical structure to decompose tasks into action primitives. Sutton et al. (1999) 2020) proposes a method that projects the conflict gradient onto the normal plane to avoid some task gradients dominating others. Compared with the hard hierarchy methods, these methods use the state's natural features to update the upper-level policy, avoiding the timing constraints of handcrafted sub-tasks. Due to the lack of meaningful learning goals of sub-policies, the low-level policies fail to focus on explainable sub-tasks.



https://github.com/MeSH-ICLR/MEtaSoftHierarchy.git



models temporal abstraction as an option on top of extended actions, Bacon et al. (2017) proposes an actor-critic option method based on it. Henderson et al. (2017) extend the options framework to learn joint reward-policy options. Besides, Jiang et al. (2019) construct a compositional structure with languages as abstraction or instruction. Due to specific structure design of these methods, high-level agent is unable to execute multiple sub-policies simultaneously in any form. Recent HRL works learn intra-goals to instruct sub-policies. Vezhnevets et al. (2017) proposes a manager-worker model, manager abstracts goals and instructs worker. This architecture uses directional goal rather than absolute update goal. Oh et al. (2017) learns a meta controller to instruct implementation and update of sub-tasks. Igl et al. (2020) presents a new soft hierarchy method based on option, it learns with shared prior and hierarchical posterior policies. Yu et al. (

