A SIMPLE APPROACH FOR VISUAL ROOM REARRANGE-MENT: 3D MAPPING AND SEMANTIC SEARCH

Abstract

Physically rearranging objects is an important capability for embodied agents. Visual room rearrangement evaluates an agent's ability to rearrange objects in a room to a desired goal based solely on visual input. We propose a simple yet effective method for this problem: (1) search for and map which objects need to be rearranged, and (2) rearrange each object until the task is complete. Our approach consists of an off-the-shelf semantic segmentation model, voxelbased semantic map, and semantic search policy to efficiently find objects that need to be rearranged. Our method was the winning submission to the AI2-THOR Rearrangement Challenge in the 2022 Embodied AI Workshop at CVPR 2022, and improves on current state-of-the-art end-to-end reinforcement learningbased methods that learn visual room rearrangement policies from 0.53% correct rearrangement to 16.56%, using only 2.7% as many samples from the environment.

1. INTRODUCTION

Physically rearranging objects is an everyday skill for humans, but remains a core challenge for embodied agents that assist humans in realistic environments. Natural environments for humans are complex and require generalization to a combinatorially large number of object configurations (Batra et al., 2020a) . Generalization in complex realistic environments remains an immense practical challenge for embodied agents, and the rearrangement setting provides a rich test bed for embodied generalization in these environments. The rearrangement setting combines two challenging perception and control tasks: (1) understanding the state of a dynamic 3D environment, and (2) acting over a long horizon to reach a goal. These problems have traditionally been studied independently by the vision and reinforcement learning communities (Chaplot et al., 2021) , but the advent of large models and challenging benchmarks is showing that both components are important for embodied agents. Reinforcement learning (RL) can excel at embodied tasks, especially if a lot of experience can be leveraged (Weihs et al., 2021; Chaplot et al., 2020b; Ye et al., 2021) for training. In a simulated environment with unlimited retries, this experience is cheap to obtain, and agents can explore randomly until a good solution is discovered by the agent. This pipeline works well for tasks like point-goal navigation (Wijmans et al., 2020) , but in some cases this strategy is not enough. As the difficulty of embodied learning tasks increases, the agent must generalize to an increasing number of environment configurations, and broadly scaled experience can become insufficient. In the rearrangement setting, a perfect understanding of the environment simplifies the problem: an object is here, it should go there, and the rest can be solved with grasping and planning routines. Representing the information about the locations and states of objects in an accessible format is therefore an important contribution for the rearrangement setting. Our initial experiments suggest that accurate 3D semantic maps of the environment are one such accessible format for visual rearrangement. With accurate 3D semantic maps, our method rearranges 15.11% of objects correctly, and requires significantly less experience from the environment to do so. While end-to-end RL requires up to 75 million environment steps in Weihs et al. (2021) , our method only requires 2 million

