WATCH-AND-HELP: A CHALLENGE FOR SOCIAL PER-CEPTION AND HUMAN-AI COLLABORATION

Abstract

In this paper, we introduce Watch-And-Help (WAH), a challenge for testing social intelligence in agents. In WAH, an AI agent needs to help a human-like agent perform a complex household task efficiently. To succeed, the AI agent needs to i) understand the underlying goal of the task by watching a single demonstration of the human-like agent performing the same task (social perception), and ii) coordinate with the human-like agent to solve the task in an unseen environment as fast as possible (human-AI collaboration). For this challenge, we build VirtualHome-Social, a multi-agent household environment, and provide a benchmark including both planning and learning based baselines. We evaluate the performance of AI agents with the human-like agent as well as with real humans using objective metrics and subjective user ratings. Experimental results demonstrate that the proposed challenge and virtual environment enable a systematic evaluation on the important aspects of machine social intelligence at scale. 1

1. INTRODUCTION

Humans exhibit altruistic behaviors at an early age (Warneken & Tomasello, 2006) . Without much prior experience, children can robustly recognize goals of other people by simply watching them act in an environment, and are able to come up with plans to help them, even in novel scenarios. In contrast, the most advanced AI systems to date still struggle with such basic social skills. In order to achieve the level of social intelligence required to effectively help humans, an AI agent should acquire two key abilities: i) social perception, i.e., the ability to understand human behavior, and ii) collaborative planning, i.e., the ability to reason about the physical environment and plan its actions to coordinate with humans. In this paper, we are interested in developing AI agents with these two abilities. Towards this goal, we introduce a new AI challenge, Watch-And-Help (WAH), which focuses on social perception and human-AI collaboration. In this challenge, an AI agent needs to collaborate with a human-like agent to enable it to achieve the goal faster. In particular, we present a 2-stage framework as shown in Figure 1 . In the first, Watch stage, an AI agent (Bob) watches a human-like agent (Alice) performing a task once and infers Alice's goal from her actions. In the second, Help stage, Bob helps Alice achieve the same goal in a different environment as quickly as possible (i.e., with the minimum number of environment steps). This 2-stage framework poses unique challenges for human-AI collaboration. Unlike prior work which provides a common goal a priori or considers a small goal space (Goodrich & Schultz, 2007; Carroll et al., 2019) , our AI agent has to reason about what the human-like agent is trying to achieve by watching a single demonstration. Furthermore, the AI agent has to generalize its acquired knowl- To enable multi-agent interactions in realistic environments, we extend an open source virtual platform, VirtualHome (Puig et al., 2018) , and build a multi-agent virtual environment, VirtualHome-Social. VirtualHome-Social simulates realistic and rich home environments where agents can interact with different objects (e.g, by opening a container or grabbing an object) and with other agents (e.g., following, helping, avoiding collisions) to perform complex tasks. VirtualHome-Social also provides i) built-in agents that emulate human behaviors, allowing training and testing of AI agents alongside virtual humans, and ii) an interface for human players, allowing evaluation with real humans and collecting/displaying human activities in realistic environments (a functionality key to machine social intelligence tasks but not offered by existing multi-agent platforms). We plan to open source our environment. We design an evaluation protocol and provide a benchmark for the challenge, including a goal inference model for the Watch stage, and multiple planning and deep reinforcement learning (DRL) baselines for the Help stage. Experimental results indicate that to achieve success in the proposed challenge, AI agents must acquire strong social perception and generalizable helping strategies. These fundamental aspects of machine social intelligence have been shown to be key to human-AI collaboration in prior work (Grosz & Kraus, 1996; Albrecht & Stone, 2018) . In this work, we demonstrate how we can systematically evaluate them in more realistic settings at scale. The main contributions of our work are: i) a new social intelligence challenge, Watch-And-Help, for evaluating AI agents' social perception and their ability to collaborate with other agents, ii) a multiagent platform allowing AI agents to perform complex household tasks by interacting with objects and with built-in agents or real humans, and iii) a benchmark consisting of multiple planning and learning based approaches which highlights important aspects of machine social intelligence.

2. RELATED WORK

Human activity understanding. An important part of the challenge is to understand human activities. Prior work on activity recognition has been mostly focused on recognizing short actions (Sigurdsson et al., 2018; Caba Heilbron et al., 2015; Fouhey et al., 2018) , predicting pedestrian trajectories (Kitani et al., 2012; Alahi et al., 2016) , recognizing group activities (Shu et al., 2015; Choi & Savarese, 2013; Ibrahim et al., 2016) , and recognizing plans (Kautz, 1991; Ramırez & Geffner, 2009) . We are interested in the kinds of activity understanding that require inferring other people's mental states (e.g., intentions, desires, beliefs) from observing their behaviors. Therefore, the Watch stage of our challenge focuses on the understanding of humans' goals in a long sequence of actions instead. This is closely related to work on computational Theory of Mind that aims at inferring



Code and documentation for the VirtualHome-Social environment are available at https:// virtual-home.org. Code and data for the WAH challenge are available at https://github.com/ xavierpuigf/watch_and_help. A supplementary video can be viewed at https://youtu.be/ lrB4K2i8xPI.



Figure 1: Overview of the Watch-And-Help challenge. The challenge has two stages: i) in the Watch stage, Bob will watch a single demonstration of Alice performing a task and infer her goal; ii) then in the Help stage, based on the inferred goal, Bob will work with Alice to help finish the same task as fast as possible in a different environment.

