SQA3D: SITUATED QUESTION ANSWERING IN 3D SCENES



xiaojian.ma@ucla.edu, yongzl19@mails.tsinghua.edu.cn {zlzheng,liqing,sczhu,syhuang}@bigai.ai, yitaol@pku.edu.cn

Description

Sitting at the edge of the bed and facing the couch. Question q : Can I go straight to the coffee table in front of me? Scene context : 3D scan, egocentric video, birdeye view (BEV) picture, etc. Given scene context S (e.g., 3D scan, egocentric video, bird-eye view picture), SQA3D requires an agent to first comprehend and localize its situation (position, orientation, etc.) in the 3D scene from a textual description s txt , then answer a question q under that situation. Note that understanding the situation and imagining the corresponding egocentric view correctly is necessary to accomplish our task. We provide more example questions in Figure 2 .

ABSTRACT

We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capabilities. Code and data are released at sqa3d.github.io.

1. INTRODUCTION

In recent years, the endeavor of building intelligent embodied agents has delivered fruitful achievements. or dialogues. Albeit these promising advances, their actual performances in real-world embodied environments could still fall short of human expectations, especially in generalization to different situations (scenes and locations) and tasks that require substantial, knowledge-intensive reasoning. To diagnose the fundamental capability of realistic embodied agents, we investigate the problem of embodied scene understanding, where the agent needs to understand its situation and the surroundings in the environment from a dynamic egocentric view, then perceive, reason, and act accordingly, to accomplish complex tasks. What is at the core of embodied scene understanding? Drawing inspirations from situated cognition (Greeno, 1998; Anderson et al., 2000) , a seminal theory of embodiment, we anticipate it to be two-fold: • Situation understanding. The ability to imagine what the agent will see from arbitrary situations (position, orientations, etc.) in a 3D scene and understand the surroundings anchored to the situation, therefore generalize to novel positions or scenes; • Situated reasoning. The ability to acquire knowledge about the environment based on the agents' current situation and reason with the knowledge, therefore further facilitates accomplishing complex action planning tasks. To step towards embodied scene understanding, we introduce SQA3D, a new task that reconciles the best of both parties, situation understanding, and situated reasoning, into embodied 3D scene understanding. Figure 1 sketches our task: given a 3D scene context (e.g., 3D scan, ego-centric video, or bird-eye view (BEV) picture), the agent in the 3D scene needs to first comprehend and localize its situation (position, orientation, etc.) from a textual description, then answer a question that requires substantial situated reasoning from that perspective. We crowd-sourced the situation descriptions from Amazon MTurk (AMT), where participants are instructed to select diverse locations and orientations in 3D scenes. To systematically examine the agent's ability in situated reasoning, we collect questions that cover a wide spectrum of knowledge, ranging from spatial relations to navigation, common sense reasoning, and multi-hop reasoning. In total, SQA3D comprises 20.4k descriptions of 6.8k unique situations collected from 650 ScanNet scenes and 33.4k questions about these situations. Examples of SQA3D can be found Figure 2 . Our task closely connects to the recent efforts on 3D language grounding (Dai et al., 2017; Chen et al., 2020; 2021; Hong et al., 2021b; Achlioptas et al., 2020; Wang et al., 2022; Azuma et al., 2022) . However, most of these avenues assume observations of a 3D scene are made from some third-person perspectives rather than an embodied, egocentric view, and they primarily inspect spatial understanding, while SQA3D examines scene understanding with a wide range of knowledge, and the problems have to be solved using an (imagined) first-person view. Embodied QA (Das et al., 2018; Wijmans et al., 2019a) draws very similar motivation as SQA3D, but our task adopts a simplified protocol (QA only) while still preserving the function of benchmarking embodied scene understanding, therefore allowing more complex, knowledge-intensive questions and a much larger scale of data collection. Comparisons with relevant tasks and benchmarks are listed in Table 1 . Benchmarking existing baselines: In our experiments, we examine state-of-the-art multi-modal reasoning models, including ScanQA from Azuma et al. ( 2022) that leverages 3D scan data, Clip-



Figure1: Task illustration of Situated Question Answering in 3D Scenes (SQA3D). Given scene context S (e.g., 3D scan, egocentric video, bird-eye view picture), SQA3D requires an agent to first comprehend and localize its situation (position, orientation, etc.) in the 3D scene from a textual description s txt , then answer a question q under that situation. Note that understanding the situation and imagining the corresponding egocentric view correctly is necessary to accomplish our task. We provide more example questions in Figure2.

Figure 2: Examples from SQA3D. We provide some example questions and the corresponding situations (s txt and) and 3D scenes. The categories listed here do not mean to be exhaustive and a question could fall into multiple categories. The green boxes indicate relevant objects in situation description s txt while red boxes are for the questions q.Embodied activities Navigation Common sense Multi-hop reasoning

