WHAT CAN YOU LEARN FROM YOUR MUSCLES? LEARNING VISUAL REPRESENTATION FROM HUMAN INTERACTIONS

Abstract

Learning effective representations of visual data that generalize to a variety of downstream tasks has been a long quest for computer vision. Most representation learning approaches rely solely on visual data such as images or videos. In this paper, we explore a novel approach, where we use human interaction and attention cues to investigate whether we can learn better representations compared to visualonly representations. For this study, we collect a dataset of human interactions capturing body part movements and gaze in their daily lives. Our experiments show that our "muscly-supervised" representation that encodes interaction and attention cues outperforms a visual-only state-of-the-art method MoCo (He et al., 2020), on a variety of target tasks: scene classification (semantic), action recognition (temporal), depth estimation (geometric), dynamics prediction (physics) and walkable surface estimation (affordance). Our code and dataset are available at:



Figure 1 : We propose to use human's interactions with their visual surrounding as a training signal for representation learning. We record first person observations as well as the movements and gaze of people living their daily routines and use these cues to learn a visual embedding. We use the learned representation on a variety of diverse tasks and show consistent improvements compared to state-of-the-art self-supervised vision-only techniques.

1. INTRODUCTION

Encoding visual information from pixel space to a lower-dimensional vector is the core element of most modern deep learning-based solutions to computer vision. A rich set of algorithms and architectures have been developed to enable learning these encodings. A common practice in computer vision is to explicitly train the networks to map visual inputs to a curated label space. For example, a neural network is pre-trained using a large-scale annotated classification dataset (Deng et al., 2009; Krasin et al., 2017) and the entire network or part of it is fine-tuned to a new target task (Goyal et al., 2019; Zamir et al., 2018) . In recent years, weakly supervised and self-supervised representation learning approaches (e.g., Mahajan et al. (2018); He et al. (2020) ; Chen et al. (2020a)) have been proposed to mitigate the need for supervision. The most successful ones are contrastive learning-based approaches such as (Chen et al., 2020c; b) and they have shown remarkable results on target tasks such as image classification and object detection. Despite their success, there are two primary caveats: (1) These self-supervised methods are still trained on ImageNet or similar datasets, which are fairly cleaned up and/or include a pre-specified set of object categories. (2) This method of training is a passive approach in that it does not encode interactions. On the contrary, for humans, a vast majority of our visual understanding is shaped by our interactions and our observations of others interacting with their environments. We are not limited to learning from visual cues alone, and there are various other supervisory signals such as body movements and attention cues available to us. It is shown that by learning how to move the joints to walk and crawl, infants can significantly enhance their perception and cognition (Adolph & Robinson, 2015) . Moreover, by observing another person interact with the environment humans obtain a visual and physical perception of the world (Bandura, 1977) . The question we investigate in this paper is, "can we learn a rich generalizable visual representation by encoding human interactions into our visual features?". In this work, we consider the movement of human body parts and the center of attention (gaze) as an indicator of their interactions with the environment and propose an approach for incorporating interaction information into the musclysupervisedrepresentation learning process. To study what we can learn from interaction, we attach sensors to humans' limbs and see how they react to visual events in their daily lives. More specifically, we record the movements of the body parts by Inertial Movement Units (IMUs) and also the gaze to monitor the center of attention. We introduce a new dataset of more than 4,500 minutes of interaction by 35 participants engaging in everyday scenarios with their corresponding body part movements and center of attention. There are no constraints on the actions, and no manual annotations or labels are provided. Our experiments show that the representation we learn by predicting gaze and body movements in addition to the visual cues outperforms the visual-only baseline on a diverse set of target tasks (Figure 1 ): semantic (scene classification), temporal (action recognition), geometric (depth estimation), physics (dynamics prediction) and affordance-based (walkable surface estimation). This shows that movement and gaze information can help to learn a more informative representation compared to a visual-only model.

2. RELATED WORK

Visual representations can be learned using many different techniques from full supervision to no supervision at all. We outline the most common paradigms of representation learning, namely supervised, self-supervised, and interaction-based representation learning. Supervised Representation Learning. Supervised representation learning in computer vision is typically performed by pre-training neural networks on large-scale datasets with full supervision (e.g., ImageNet (Deng et al., 2009) ) or weak supervision (e.g., Instagram-1B (Mahajan et al., 2018) ). These models are fine-tuned for a variety of tasks including object detection (Girshick et al., 2014; Ren et al., 2015) , semantic segmentation (Shelhamer et al., 2015; Chen et al., 2017) , and visual question answering (Agrawal et al., 2015a; Hudson & Manning, 2019) . However, collecting a manually annotated large-scale dataset such as ImageNet requires extensive resources in terms of cost and time. In contrast, in this paper, we only use human interaction data, which does not require any manual annotation. Self-supervised Representation Learning. There has been a wide range of research on selfsupervised learning of visual representations in which properties of the images themselves act as supervision. The objectives for these methods cover a variety of tasks such as solving jigsaw puzzles (Noroozi & Favaro, 2016) , colorizing grayscale images (Zhang et al., 2016) , learning to count (Noroozi et al., 2017 ), predicting context (Doersch et al., 2015) , inpainting (Pathak et al., 2016 ), adversarial training (Donahue et al., 2017) and predicting image rotations (Gidaris et al., 2018) . This type of representation learning is not limited to learning from single frames. Inspired by contrastive learning (Hadsell et al., 2006) , recent methods have used "instance discrimination" in which the network uniquely identifies each image. A network is trained to produce a non-linear mapping that projects multiple variations of an image closer to each other than to all other



Agrawal et al. (2015b) and Jayaraman & Grauman (2015) both use egomotion, Wang & Gupta (2015) cyclically track patches in videos, Pathak et al. (2017) use low-level non-semantic motion-based cues, and Vondrick et al. (2016) predict the representation of future frames.

availability

https://github.com/ehsanik/muscleTorch.

