A SIMPLE APPROACH FOR VISUAL ROOM REARRANGE-MENT: 3D MAPPING AND SEMANTIC SEARCH

Abstract

Physically rearranging objects is an important capability for embodied agents. Visual room rearrangement evaluates an agent's ability to rearrange objects in a room to a desired goal based solely on visual input. We propose a simple yet effective method for this problem: (1) search for and map which objects need to be rearranged, and (2) rearrange each object until the task is complete. Our approach consists of an off-the-shelf semantic segmentation model, voxelbased semantic map, and semantic search policy to efficiently find objects that need to be rearranged. Our method was the winning submission to the AI2-THOR Rearrangement Challenge in the 2022 Embodied AI Workshop at CVPR 2022, and improves on current state-of-the-art end-to-end reinforcement learningbased methods that learn visual room rearrangement policies from 0.53% correct rearrangement to 16.56%, using only 2.7% as many samples from the environment.

1. INTRODUCTION

Physically rearranging objects is an everyday skill for humans, but remains a core challenge for embodied agents that assist humans in realistic environments. Natural environments for humans are complex and require generalization to a combinatorially large number of object configurations (Batra et al., 2020a) . Generalization in complex realistic environments remains an immense practical challenge for embodied agents, and the rearrangement setting provides a rich test bed for embodied generalization in these environments. The rearrangement setting combines two challenging perception and control tasks: (1) understanding the state of a dynamic 3D environment, and (2) acting over a long horizon to reach a goal. These problems have traditionally been studied independently by the vision and reinforcement learning communities (Chaplot et al., 2021) , but the advent of large models and challenging benchmarks is showing that both components are important for embodied agents. Reinforcement learning (RL) can excel at embodied tasks, especially if a lot of experience can be leveraged (Weihs et al., 2021; Chaplot et al., 2020b; Ye et al., 2021) for training. In a simulated environment with unlimited retries, this experience is cheap to obtain, and agents can explore randomly until a good solution is discovered by the agent. This pipeline works well for tasks like point-goal navigation (Wijmans et al., 2020) , but in some cases this strategy is not enough. As the difficulty of embodied learning tasks increases, the agent must generalize to an increasing number of environment configurations, and broadly scaled experience can become insufficient. In the rearrangement setting, a perfect understanding of the environment simplifies the problem: an object is here, it should go there, and the rest can be solved with grasping and planning routines. Representing the information about the locations and states of objects in an accessible format is therefore an important contribution for the rearrangement setting. Our initial experiments suggest that accurate 3D semantic maps of the environment are one such accessible format for visual rearrangement. With accurate 3D semantic maps, our method rearranges 15.11% of objects correctly, and requires significantly less experience from the environment to do so. While end-to-end RL requires up to 75 million environment steps in Weihs et al. ( 2021 samples and trains offline. Our results suggest end-to-end RL without an accurate representation of the scene may be missing out on a fundamental aspect of understanding of the environment. We demonstrate how semantic maps help agents effectively understand dynamic 3D environments and perform visual rearrangement. These dynamic environments have elements that can move (like furniture), and objects with changing states (like the door of a cabinet). We present a method that builds accurate semantic maps in these dynamic environments, and reasons about what has changed. Deviating from prior work that leverages end-to-end RL, we propose a simple approach for visual rearrangement: (1) search for and map which objects need to be rearranged, and (2) procedurally rearrange objects until a desired goal configuration is reached. We evaluate our approach on the AI2-THOR Rearrangement Challenge (Weihs et al., 2021) and establish a new state-of-the-art. We propose an architecture for visual rearrangement that builds voxel-based semantic maps of the environment and rapidly finds objects using a search-based policy. Our method shows an improvement of 14.72 absolute percentage points over current work in visual rearrangement, and is robust to the accuracy of the perception model, the budget for exploration, and the size of objects being rearranged. We conduct ablations to diagnose where the bottlenecks are for visual rearrangement, and find that accurate scene understanding is the most crucial. As an upper bound, when provided with a perfect semantic map, our method solves 38.33% of tasks, a potential for significant out-of-the-box gains as better perception models are developed. Our results show the importance of building effective scene representations for embodied agents in complex and dynamic visual environments.

2. RELATED WORK

Embodied 3D Scene Understanding. Knowledge of the 3D environment is at the heart of various tasks for embodied agents, such as point navigation (Anderson et al., 2018a) , image navigation (Batra et al., 2020b; Yang et al., 2019) , vision language navigation (Anderson et al., 2018b; Shridhar et al., 2020) , embodied question answering (Gordon et al., 2018; Das et al., 2018) , and more. These tasks require an agent to reason about its 3D environment. For example, vision language navigation (Anderson et al., 2018b; Shridhar et al., 2020) requires grounding language in an environment goal, and reasoning about where to navigate and what to modify in the environment to reach that goal. Reasoning about the 3D environment is especially important for the rearrangement setting, and has a rich interdisciplinary history in the robotics, vision, and reinforcement learning communities. Visual Room Rearrangement. Rearrangement has long been one of the fundamental tasks in robotics research (Ben-Shahar & Rivlin, 1996; Stilman et al., 2007; King et al., 2016; Krontiris & Bekris, 2016; Yuan et al., 2018; Correll et al., 2018; Labbé et al., 2020) . Typically, these methods address the challenge in the context of the state of the objects being fully observed (Cosgun et al., 2011; King et al., 2016) , which allows for efficient and accurate planning-based solutions. In contrast, there has been recent interest in visual rearrangement (Batra et al., 2020a; Weihs et al., 2021; Qureshi et al., 2021; Goyal et al., 2022; Gadre et al., 2022) where the states of objects and the rearrangement



Figure1: Our method incrementally builds voxel-based Semantic Maps from visual observations and efficiently finds objects using a Semantic Search Policy. We visualize an example rearrangement on the right with the initial position of the pink object (laptop on the bed), followed by the agent holding the object (laptop), and finally the destination position of the object (laptop on the desk).

