LEARNING REWARDS AND SKILLS TO FOLLOW COMMANDS WITH A DATA EFFICIENT VISUAL-AUDIO REPRESENTATION Anonymous

Abstract

Based on the recent advancements in representation learning, we propose a novel framework for command-following robots with raw sensor inputs. Previous RLbased methods are either difficult to continuously improve after the deployment or require a large number of new labels during the fine-tuning. Motivated by (self-)supervised contrastive learning literature, we propose a novel representation, named VAR++, that generates an intrinsic reward function for commandfollowing robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts, and the robot is able to fulfill sound commands without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In the simulated experiments, we show that our system can continually self-improve in previously unseen scenarios given fewer new labeled data, yet achieves better performance, compared with previous methods.

1. INTRODUCTION

When humans are told to turn on a TV, they can associate what they hear with what they see even in unfamiliar environments. For robots to follow commands and fulfill similar tasks, they must ground task-oriented language to vision and motor skills. Command following robots is such an important application that paves the way for non-experts to intuitively communicate and collaborate with robots in daily lives. The need for command following robots has spurred a wealth of research. Learning-based language grounding agents were proposed to perform tasks according to visual observations and text/speech instructions Anderson et al. ( 2018 2022) and large amounts of labels, neither of which can be afforded by non-expert users after deployment. Without enough domain expertise or abundant labeled data, how can we allow users to adapt such robots to novel domains with minimal supervision? Prior works have partially answered this question by proposing a visual-audio representation (VAR) trained with triplet loss, which associates audio commands and goal images with the same intent Chang et al. (2021) . However, due to true negative pairs used in the triplet loss, the number of labels required to fine-tune the VAR is still not satisfactory, thus hindering an efficient deployment in the target domain. In this paper, we propose a novel framework that builds on (self-)supervised contrastive learning to realize more effective training and more efficient fine-tuning for rewards and skills learning. As 1



); Chang et al. (2020); Chaplot et al. (2018); Hermann et al. (2017); Shridhar et al. (2020); Yu et al. (2018). However, these approaches fail to completely solve a common problem in learning-based methods: performance degradation in a novel target domain Akkaya et al. (2019); James et al. (2019); Tobin et al. (2017). One solution to address the domain shift problem is domain randomization Tobin et al. (2017). However, it has been shown that domain randomization alone is not sufficient since the randomized simulation may not accurately reflect the target domain that the robot is later deployed in Du et al. (2021); Smith et al. (2022). Alternatively, fine-tuning policies in the target domain can further reduce the reality gap but is often cost prohibitive: professionals usually train the robots with hand-crafted, task-specific reward functions Haarnoja et al. (2019); Smith et al. (

