LEARNING REWARDS AND SKILLS TO FOLLOW COMMANDS WITH A DATA EFFICIENT VISUAL-AUDIO REPRESENTATION Anonymous

Abstract

Based on the recent advancements in representation learning, we propose a novel framework for command-following robots with raw sensor inputs. Previous RLbased methods are either difficult to continuously improve after the deployment or require a large number of new labels during the fine-tuning. Motivated by (self-)supervised contrastive learning literature, we propose a novel representation, named VAR++, that generates an intrinsic reward function for commandfollowing robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts, and the robot is able to fulfill sound commands without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In the simulated experiments, we show that our system can continually self-improve in previously unseen scenarios given fewer new labeled data, yet achieves better performance, compared with previous methods.

1. INTRODUCTION

When humans are told to turn on a TV, they can associate what they hear with what they see even in unfamiliar environments. For robots to follow commands and fulfill similar tasks, they must ground task-oriented language to vision and motor skills. Command following robots is such an important application that paves the way for non-experts to intuitively communicate and collaborate with robots in daily lives. The need for command following robots has spurred a wealth of research. Learning-based language grounding agents were proposed to perform tasks according to visual observations and text/speech instructions Anderson et al. ( 2018 Without enough domain expertise or abundant labeled data, how can we allow users to adapt such robots to novel domains with minimal supervision? Prior works have partially answered this question by proposing a visual-audio representation (VAR) trained with triplet loss, which associates audio commands and goal images with the same intent Chang et al. (2021) . However, due to true negative pairs used in the triplet loss, the number of labels required to fine-tune the VAR is still not satisfactory, thus hindering an efficient deployment in the target domain. In this paper, we propose a novel framework that builds on (self-)supervised contrastive learning to realize more effective training and more efficient fine-tuning for rewards and skills learning. As 2021). In the second stage, we use VAR++ to compute intrinsic reward functions to learn various robot skills with RL without any reward engineering. When the robot is deployed in a new domain such as a different room, the fine-tuning stage is data efficient in terms of the label usage and is natural to non-experts in terms of the human-robot interaction. For example, a user can teach a robot or VAR++ by saying that "this is an apple" when the robot sees an apple. Then, RL policies are self-improved with the updated VAR++. No hand-designed reward or negative pairs are needed as in the previous works. We apply this learning approach to different robotic tasks in diverse settings as illustrated in Fig. 1 and Fig. 3 . Given a sound command, the robot must identify the commander's goal (intent), draw the correspondence between the raw visual and audio inputs, and develop a policy to finish the task. The tasks are challenging because no maps, depth images, human demonstrations, or prior knowledge are available, and the observation mainly comes from a monocular uncalibrated RGB camera. Our main contributions are as follows: (1) We propose a novel representation of visual-audio observations for command following robots, named VAR++. To our best knowledge, it is the first work demonstrating that (self-)supervised contrastive loss improves robot control performance from triplet loss. (2) We propose a data efficient fine-tuning method for command following robots which requires significantly fewer labels than baselines. Moreover, our fine-tuning method demonstrates that (self-)supervised contrastive loss has the potential to enhance user experiences, especially for non-experts. (3) We release our simulation environments and model implementations. The simulations are the first open-sourced AI environments which use real speech recordings for robotic command following tasks. The code will be released after the review at www.github.com

2. RELATED WORKS

End-to-end language understanding. End-to-end spoken language understanding (SLU) systems extract speaker's intent directly from raw speech signals without translating the speech to text Kim et al. (2021); Lugosch et al. (2019); Serdyuk et al. (2018) . Such an end-to-end system is able to fully exploit subtle information, such as speaker emotion, that is lost during speech to text transcription Kim et al. (2021); Lugosch et al. (2019) . However, end-to-end SLU systems are mainly developed for virtual digital assistants and not for robotic applications. Language grounding agents. Conventional language grounding agents consist of independent modules for transcription, language grounding, and planning Magassouba et al. ( 2019 



); Chang et al. (2020); Chaplot et al. (2018); Hermann et al. (2017); Shridhar et al. (2020); Yu et al. (2018). However, these approaches fail to completely solve a common problem in learning-based methods: performance degradation in a novel target domain Akkaya et al. (2019); James et al. (2019); Tobin et al. (2017). One solution to address the domain shift problem is domain randomization Tobin et al. (2017). However, it has been shown that domain randomization alone is not sufficient since the randomized simulation may not accurately reflect the target domain that the robot is later deployed in Du et al. (2021); Smith et al. (2022). Alternatively, fine-tuning policies in the target domain can further reduce the reality gap but is often cost prohibitive: professionals usually train the robots with hand-crafted, task-specific reward functions Haarnoja et al. (2019); Smith et al. (2022) and large amounts of labels, neither of which can be afforded by non-expert users after deployment.

Figure 1: Illustration of our pipeline. Contrastive learning is used to group images and audio commands of the same intent. The resulting representation VAR++ supports the downstream RL training by encoding the high-dimensional voice and image signals, and providing reward signals and states to the agent.

); Paul et al. (2018); Stramandinoli et al. (2016). These modular pipelines suffer from intermediate errors and do not generalize beyond their programmed domains Hermann et al. (2017); Tada et al. (2020); Vanzo et al. (2016). To address these problems, end-to-end language grounding agents are used to perform tasks according to text-based natural language instructions and visual observations Anderson et al. (2018); Chaplot et al. (2018); Hermann et al. (2017); Shridhar et al. (2020); Yu et al. (2018). Our

