UNDERSTANDING EMBODIED REFERENCE WITH TOUCH-LINE TRANSFORMER

Abstract

We study embodied reference understanding, the task of locating referents using embodied gestural signals and language references. Human studies have revealed that, contrary to popular belief, objects referred to or pointed to do not lie on the elbow-wrist line, but rather on the so-called virtual touch line. Nevertheless, contemporary human pose representations lack the virtual touch line. To tackle this problem, we devise the touch-line Transformer: It takes as input tokenized visual and textual features and simultaneously predicts the referent's bounding box and a touch-line vector. Leveraging this touch-line prior, we further devise a geometric consistency loss that promotes co-linearity between referents and touch lines. Using the touch line as gestural information dramatically improves model performances: Experiments on the YouRefIt dataset demonstrate that our method yields a +25.0% accuracy improvement under the 0.75 IoU criterion, hence closing 63.6% of the performance difference between models and humans. Furthermore, we computationally validate prior human studies by demonstrating that computational models more accurately locate referents when employing the virtual touch line than when using the elbow-wrist line.

1. INTRODUCTION

Figure 1 : To accurately locate the referent in complex scenes, both nonverbal and verbal expressions are vital. Without nonverbal expression (in this case, the pointing gesture), the verbal expression ("the chair") cannot uniquely refer to the chair because multiple chairs are present in this context. Conversely, one cannot distinguish the intended referent "the chair" from other nearby objects with only nonverbal expressions. Understanding human intents is essential when intelligent robots interact with humans. Nevertheless, most prior work in the modern learning community disregards the multi-modal facet of human-robot communication. Consider the scenario depicted in Figure 1 , wherein a person instructs the robot to interact with a chair behind the table. In response, the robot must comprehend what humans are referring to before taking action (e.g., approaching the object and cleaning it). Notably, both embodied gesture signals and language reference play significant roles. Without the pointing gesture, the robot could not distinguish between the two chairs using the utterance the chair that is occluded. Likewise, without the language expression, the robot could not differentiate the chair from other objects in that vicinity (e.g., bags on the table). To address this deficiency, we investigate the embodied reference understanding (ERU) task introduced by Chen et al. (2021) . This task requires an algorithm to detect the referent (referred object) using (i) an image/video containing nonverbal communication signals and (ii) a sentence as a verbal communication signal. The first fundamental challenge in tackling ERU is the representation of human pose. The de facto and prevalent pose representation in modern computer vision is defined by COCO (Lin et al., 2014)a graph consisting of 17 nodes (keypoints) and 14 edges (keypoint connectivities). Existing models for ERU (Chen et al., 2021) assume pre-extracted COCO-style pose features to be the algorithm inputs. However, we rethink the limitations of the COCO-style pose graph in the context of ERU and uncover a counter-intuitive fact: The referent does not lie on the elbow-wrist line (i.e., the line that links the human elbow and wrist). As shown in Figure 2 , this line (in red) does not cross the referred microwave, exhibiting a typical misinterpretation of human pointing (Herbort & Kunde, 2018). A recent developmental study (O'Madagain et al., 2019) presents compelling evidence supporting the above hypothesis. It studies how humans mentally develop pointing gestures and argues that it is a virtual form of reaching out to touch. This new finding challenges conventional psychological views (McGinn, 1981; Kita, 2003) that the pointing gesture is mentally a behavior of using the limb as an arrow. Inheriting the terminology in O'Madagain et al. ( 2019), we coin the red line in Figure 2 as an elbowwrist line (EWL) and the yellow line (which connects the eye and the fingertip) as a virtual touch line (VTL). Inspired by this essential observation that VTLs are more accurate than EWLs in embodied reference, we augment the existing COCO-style pose graph with an edge that connects the eye and the fingertip. As validated by a series of experiments in Section 4, this augmentation significantly improves the performance on YouRefIt. The second fundamental issue in tackling ERU is how to jointly model gestural signals and language references. Inspired by the success of multi-modal Transformers (Chen et al., 2020; Li et al., 2020; Tan & Bansal, 2019; Lu et al., 2019; Kamath et al., 2021) in multi-modal tasks (Hudson & Manning, 2019; Antol et al., 2015; Zellers et al., 2019) , we devise the Touch-Line Transformer. Our Transformer-based model takes as inputs both visual and natural-language modalities. Our Touch-Line Transformer jointly models gestural signals and language references by simultaneously predicting the touch-line vector and the referent's bounding box. To further help our model utilize gestural signals (i.e., the touch-line vector), we integrate a geometric consistency loss to encourage co-linearity between the touch line and the predicted referent's location, resulting in significant performance improvements. By leveraging the above two insights, our proposed method achieves a +25.0% accuracy gain under the 0.75 IoU criterion on the YouRefIt dataset compared to state-of-the-art methods. Our approach closes 63.6% of the gap between model performance and human performance. This paper makes four contributions by introducing (i) a novel computational pose representation, VTL, (ii) the Touch-Line Transformer that jointly models nonverbal gestural signals and verbal references, (iii) a computational model leveraging the concept of touch line by a novel geometric consistency loss that improves the co-linearity between the touch line and the predicted object, and (iv) a new state-of-the-art performance on ERU, exceeding the 0.75 IoU threshold by +25.0%.

2. RELATED WORK

Misinterpretation of pointing gestures Pointing enables observers and pointers to direct visual attention and establish references in communication. Recent research reveals, surprisingly, that observers make systematic errors (Herbort & Kunde, 2016): While pointers produce gestures using VTLs, observers interpret pointing gestures using the "arm-finger" line. O'Madagain et al. (2019) founded the VTL mechanism: Pointing gestures orient toward their targets as if the pointers were to touch them. In neuroscience, gaze effects occur for tasks that require gaze alignment with finger pointing (Bédard et al., 2008) . The preceding evidence demonstrates that eye position and gaze direction are crucial for understanding pointing. Critically, Herbort & Kunde (2018) verify that directing human observers to extrapolate the touch-line vector reduces the systematic misinterpretation during human-human communication. Inspired by these discoveries, we incorporate the touch-line vector to enhance the performance of pointing gesture interpretation.



Figure 2: Virtual touch line (VTL) (in green) vs. elbow-wrist line (EWL) (in red). VTLs affords a more accurate location of referents than EWLs.

