UNDERSTANDING EMBODIED REFERENCE WITH TOUCH-LINE TRANSFORMER

Abstract

We study embodied reference understanding, the task of locating referents using embodied gestural signals and language references. Human studies have revealed that, contrary to popular belief, objects referred to or pointed to do not lie on the elbow-wrist line, but rather on the so-called virtual touch line. Nevertheless, contemporary human pose representations lack the virtual touch line. To tackle this problem, we devise the touch-line Transformer: It takes as input tokenized visual and textual features and simultaneously predicts the referent's bounding box and a touch-line vector. Leveraging this touch-line prior, we further devise a geometric consistency loss that promotes co-linearity between referents and touch lines. Using the touch line as gestural information dramatically improves model performances: Experiments on the YouRefIt dataset demonstrate that our method yields a +25.0% accuracy improvement under the 0.75 IoU criterion, hence closing 63.6% of the performance difference between models and humans. Furthermore, we computationally validate prior human studies by demonstrating that computational models more accurately locate referents when employing the virtual touch line than when using the elbow-wrist line.

1. INTRODUCTION

Figure 1 : To accurately locate the referent in complex scenes, both nonverbal and verbal expressions are vital. Without nonverbal expression (in this case, the pointing gesture), the verbal expression ("the chair") cannot uniquely refer to the chair because multiple chairs are present in this context. Conversely, one cannot distinguish the intended referent "the chair" from other nearby objects with only nonverbal expressions. Understanding human intents is essential when intelligent robots interact with humans. Nevertheless, most prior work in the modern learning community disregards the multi-modal facet of human-robot communication. Consider the scenario depicted in Figure 1 , wherein a person instructs the robot to interact with a chair behind the table. In response, the robot must comprehend what humans are referring to before taking action (e.g., approaching the object and cleaning it). Notably, both embodied gesture signals and language reference play significant roles. Without the pointing gesture, the robot could not distinguish between the two chairs using the utterance the chair that is occluded. Likewise, without the language expression, the robot could not differentiate the chair from other objects in that vicinity (e.g., bags on the table). To address this deficiency, we investigate the embodied reference understanding (ERU) task introduced by Chen et al. ( 2021). This task requires an algorithm to detect the referent (referred object) using (i) an image/video containing nonverbal communication signals and (ii) a sentence as a verbal communication signal.

