HIERARCHICAL META REINFORCEMENT LEARNING FOR MULTI-TASK ENVIRONMENTS Anonymous

Abstract

Deep reinforcement learning algorithms aim to achieve human-level intelligence by solving practical decisions-making problems, which are often composed of multiple sub-tasks. Complex and subtle relationships between sub-tasks make traditional methods hard to give a promising solution. We implement a first-person shooting environment with random spatial structures to illustrate a typical representative of this kind. A desirable agent should be capable of balancing between different sub-tasks: navigation to find enemies and shooting to kill them. To address the problem brought by the environment, we propose a Meta Soft Hierarchical reinforcement learning framework (MeSH), in which each low-level sub-policy focuses on a specific sub-task respectively and high-level policy automatically learns to utilize low-level sub-policies through meta-gradients. The proposed framework is able to disentangle multiple sub-tasks and discover proper low-level policies under different situations. The effectiveness and efficiency of the framework are shown by a series of comparison experiments. Both environment and algorithm code will be provided for open source to encourage further research.

1. INTRODUCTION

With great breakthrough of deep reinforcement learning (DRL) methods (Mnih et al., 2015; Silver et al., 2016; Mnih et al., 2016; Schulman et al., 2015; Lillicrap et al., 2015) , it is an urgent need to use DRL methods to solve more complex decision-making problems. The practical problem in real world is often a subtle combination of multiple sub-tasks, which may happen simultaneously and hard to disentangle by time series. For instance, in StarCraft games (Pang et al., 2019) , agents need to consider building units and organizing battles, sub-tasks may change rapidly over the whole game process; sweeping robots tradeoff between navigating and collecting garbage; shooting agents should move to appropriate positions and launch attacks, etc. The relationship between sub-tasks is complex and subtle. Sometimes they compete with each other and need to focus on one task to gain key advantages; at other times, they need to cooperate with each other to maintain the possibility of global exploration. It is often time consuming and ineffective to learn simply by collecting experience and rewarding multiple objectives for different sub-tasks. A reasonable idea is to utilize deep hierarchical reinforcement learning (DHRL) methods (Vezhnevets et al., 2017; Igl et al., 2020) , where the whole system is divided into a high-level agent and several low-level agents. Low-level agents learn sub-policies, which select atomic actions for corresponding sub-tasks. The high-level agent is responsible for a meta task in the abstract logic or coarser time granularity, guiding low-level agents by giving a goal, or directly selecting among subpolicies. However, DHRL methods face some inherent problems: due to the complex interaction between multi-level agents, there is no theoretical guarantee of convergence, and it shows unstable experimental performance. Most DHRL methods require heavy manual design, and end-to-end system lacks reasonable semantic interpretation. In addition, these agents are often constrained by specific tasks and are easy to overfit. Even transferring between similar tasks, they perform poorly and need a lot of additional adjustments. We introduce a first-person shooting (FPS) environment with random spatial structures. The game contains a 3D scene from human perspective. When the player defeats all enemies, the player wins the game. When the player drops to the ground or losses all health points, the player loses the

