MULTI-SKILL MOBILE MANIPULATION FOR OBJECT REARRANGEMENT

Abstract

We study a modular approach to tackle long-horizon mobile manipulation tasks for object rearrangement, which decomposes a full task into a sequence of subtasks. To tackle the entire task, prior work chains multiple stationary manipulation skills with a point-goal navigation skill, which are learned individually on subtasks. Although more effective than monolithic end-to-end RL policies, this framework suffers from compounding errors in skill chaining, e.g., navigating to a bad location where a stationary manipulation skill can not reach its target to manipulate. To this end, we propose that the manipulation skills should include mobility to have flexibility in interacting with the target object from multiple locations and at the same time the navigation skill could have multiple end points which lead to successful manipulation. We operationalize these ideas by implementing mobile manipulation skills rather than stationary ones and training a navigation skill trained with region goal instead of point goal. We evaluate our multi-skill mobile manipulation method M3 on 3 challenging long-horizon mobile manipulation tasks in the Home Assistant Benchmark (HAB), and show superior performance as compared to the baselines.

1. INTRODUCTION

Building AI with embodiment is an important future mission of AI. Object rearrangement (Batra et al., 2020) is considered as a canonical task for embodied AI. The most challenging rearrangement tasks (Szot et al., 2021; Ehsani et al., 2021; Gan et al., 2021) are often long-horizon mobile manipulation tasks, which demand both navigation and manipulation abilities, e.g., to move to certain locations and to pick or place objects. It is challenging to learn a monolithic RL policy for complex long-horizon mobile manipulation tasks, due to challenges such as high sample complexity, complicated reward design, and inefficient exploration. A practical solution to tackle a long-horizon task is to decompose it into a set of subtasks, which are tractable, short-horizon, and compact in state or action spaces. Each subtask can be solved by designing or learning a skill, so that a sequence of skills can be chained to complete the entire task (Lee et al., 2018; Clegg et al., 2018; Lee et al., 2019; 2021) . For example, skills for object rearrangement can be picking or placing objects, opening or closing fridges and drawers, moving chairs, navigating in the room, etc. Achieving successful object rearrangement using this modular framework requires careful subtask formulation such that skills trained for these subtasks can be chained together effectively. We define three desirable properties for skills to solve diverse long-horizon tasks: achievability, composability, and reusability. Note that we assume each subtask is associated with a set of initial states. Then, achievability quantifies the portion of initial states solvable by a skill. A pair of skills are composable if the initial states achievable by the succeeding skill can encompass the terminal states of the preceding skill. This encompassment requirement is necessary to ensure robustness to mild compounding errors. However, trivially enlarging the initial set of a subtask increases learning difficulty and may lead to many unachievable initial states for the designed/learned skill. Last, a skill is reusable if it can be directly chained without or with limited fine-tuning (Clegg et al., 2018; Lee et al., 2021) . According to our experiments, effective subtask formulation is critical though largely overlooked in the literature.



Project website: https://sites.google.com/view/hab-m3 Codes: https://github.com/Jiayuan-Gu/hab-mobile-manipulation 1

