LEARNING REWARDS AND SKILLS TO FOLLOW COMMANDS WITH A DATA EFFICIENT VISUAL-AUDIO REPRESENTATION Anonymous

Abstract

Based on the recent advancements in representation learning, we propose a novel framework for command-following robots with raw sensor inputs. Previous RLbased methods are either difficult to continuously improve after the deployment or require a large number of new labels during the fine-tuning. Motivated by (self-)supervised contrastive learning literature, we propose a novel representation, named VAR++, that generates an intrinsic reward function for commandfollowing robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts, and the robot is able to fulfill sound commands without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In the simulated experiments, we show that our system can continually self-improve in previously unseen scenarios given fewer new labeled data, yet achieves better performance, compared with previous methods.

1. INTRODUCTION

When humans are told to turn on a TV, they can associate what they hear with what they see even in unfamiliar environments. For robots to follow commands and fulfill similar tasks, they must ground task-oriented language to vision and motor skills. Command following robots is such an important application that paves the way for non-experts to intuitively communicate and collaborate with robots in daily lives. The need for command following robots has spurred a wealth of research. Learning-based language grounding agents were proposed to perform tasks according to visual observations and text/speech instructions Anderson et al. (2018) ; Chang et al. (2020) ; Chaplot et al. (2018); Hermann et al. (2017) ; Shridhar et al. (2020) ; Yu et al. (2018) . However, these approaches fail to completely solve a common problem in learning-based methods: performance degradation in a novel target domain Akkaya et al. (2019) ; James et al. (2019) ; Tobin et al. (2017) . One solution to address the domain shift problem is domain randomization Tobin et al. (2017) . However, it has been shown that domain randomization alone is not sufficient since the randomized simulation may not accurately reflect the target domain that the robot is later deployed in Du et al. (2021) ; Smith et al. (2022) . Alternatively, fine-tuning policies in the target domain can further reduce the reality gap but is often cost prohibitive: professionals usually train the robots with hand-crafted, task-specific reward functions Haarnoja et al. (2019) ; Smith et al. (2022) and large amounts of labels, neither of which can be afforded by non-expert users after deployment. Without enough domain expertise or abundant labeled data, how can we allow users to adapt such robots to novel domains with minimal supervision? Prior works have partially answered this question by proposing a visual-audio representation (VAR) trained with triplet loss, which associates audio commands and goal images with the same intent Chang et al. (2021) . However, due to true negative pairs used in the triplet loss, the number of labels required to fine-tune the VAR is still not satisfactory, thus hindering an efficient deployment in the target domain. In this paper, we propose a novel framework that builds on (self-)supervised contrastive learning to realize more effective training and more efficient fine-tuning for rewards and skills learning. As shown in Fig. 1 , we first learn a joint representation of visual and audio signals (VAR++) whose clusters have better intra-cluster cohesion and inter-cluster separation compared to VAR Chang et al. (2021) . In the second stage, we use VAR++ to compute intrinsic reward functions to learn various robot skills with RL without any reward engineering. When the robot is deployed in a new domain such as a different room, the fine-tuning stage is data efficient in terms of the label usage and is natural to non-experts in terms of the human-robot interaction. For example, a user can teach a robot or VAR++ by saying that "this is an apple" when the robot sees an apple. Then, RL policies are self-improved with the updated VAR++. No hand-designed reward or negative pairs are needed as in the previous works. We apply this learning approach to different robotic tasks in diverse settings as illustrated in Fig. 1 and Fig. 3 . Given a sound command, the robot must identify the commander's goal (intent), draw the correspondence between the raw visual and audio inputs, and develop a policy to finish the task. The tasks are challenging because no maps, depth images, human demonstrations, or prior knowledge are available, and the observation mainly comes from a monocular uncalibrated RGB camera. Our main contributions are as follows: (1) We propose a novel representation of visual-audio observations for command following robots, named VAR++. To our best knowledge, it is the first work demonstrating that (self-)supervised contrastive loss improves robot control performance from triplet loss. (2) We propose a data efficient fine-tuning method for command following robots which requires significantly fewer labels than baselines. Moreover, our fine-tuning method demonstrates that (self-)supervised contrastive loss has the potential to enhance user experiences, especially for non-experts. (3) We release our simulation environments and model implementations. The simulations are the first open-sourced AI environments which use real speech recordings for robotic command following tasks. The code will be released after the review at www.github.com

2. RELATED WORKS

End-to-end language understanding. End-to-end spoken language understanding (SLU) systems extract speaker's intent directly from raw speech signals without translating the speech to text Kim et al. (2021) ; Lugosch et al. (2019) ; Serdyuk et al. (2018) . Such an end-to-end system is able to fully exploit subtle information, such as speaker emotion, that is lost during speech to text transcription Kim et al. (2021) ; Lugosch et al. (2019) . However, end-to-end SLU systems are mainly developed for virtual digital assistants and not for robotic applications. Language grounding agents. Conventional language grounding agents consist of independent modules for transcription, language grounding, and planning Magassouba et al. (2019) ; Paul et al. (2018) ; Stramandinoli et al. (2016) . These modular pipelines suffer from intermediate errors and do not generalize beyond their programmed domains Hermann et al. (2017); Tada et al. (2020) ; Vanzo et al. (2016) . To address these problems, end-to-end language grounding agents are used to perform tasks according to text-based natural language instructions and visual observations Anderson et al. (2018); Chaplot et al. (2018); Hermann et al. (2017) ; Shridhar et al. (2020) ; Yu et al. (2018) . Our work has two novelties from these works. First, while the above works consider text-based input, we focus on commanding agents through raw audio, which leads to more natural human-robot communication without additional modules. Second, training the agents requires either expert demonstrations, step-by-step instructions Anderson et al. (2018) ; Shridhar et al. (2020) , or a carefully designed extrinsic reward function Chaplot et al. (2018); Hermann et al. (2017) ; Yu et al. (2018) . Although some methods generalize to new command sentences and/or new scenes to some extent, they overlook the continual fine-tuning after deployment Chang et al. (2020) ; Yu et al. (2018) . In contrast, our method requires none of the above and thus requires significantly fewer efforts to fine-tune in a novel domain, where both perception and dynamics are different from the training scenes. Another line of works trains the robot to fulfill commands directly from raw audio inputs with RL Chang et al. (2020; 2021) . However, the method in Chang et al. (2020) requires hand-tuned reward functions and a prohibitive number of one-hot labels, which is still hard to fine-tune. Chang et al. (2021) partially addresses the problem by learning a visual-audio representation (VAR) with triplet loss to generate an intrinsic reward function for RL. However, the quality of the VAR in Chang et al. (2021) is suboptimal based on our quantitative evaluation. In contrast, the better representation in our work leads to more ideal reward functions and better robot performance. Representation learning for robotics. Representation learning has shown great potential in learning useful embeddings for downstream robotic tasks. Deep autoencoders have been used to compress high-dimensional observations such as images into low-dimensional latent space. The resulting latent vectors are then used as states or intrinsic rewards for RL Lange et al. (2012) ; Nair et al. (2018) ; Wang et al. (2020) . At test time, however, the methods in Nair et al. (2018) ; Wang et al. (2020) require users to provide goal images for task execution, while our method takes voice commands, which is a more natural and convenient way of human-robot communication. Additionally, reconstructions of the input images often make the autoencoders computationally expensive. Another line of works uses contrastive loss to learn representations for downstream tasks such as grasping and water pouring Jang et al. (2018) ; Nguyen et al. (2020) ; Sermanet et al. (2018) . Contrastive loss avoids the reconstruction operation in autoencoders. While all of these works focus mainly on the visual or the text modality, we address the interplay between sight and sound.

3. METHODOLOGY

In this section, we describe the two-stage training pipeline and fine-tuning procedure. In training, we assume the availability of sufficiently large labeled datasets, simulators, and labels. However, in fine-tuning, speech transcriptions, one-hot labels, and reward functions, are not available.

3.1. VISUAL-AUDIO REPRESENTATION LEARNING

In the first stage, we collect visual-audio pairs from the environment. Then, we learn a joint representation of images and audios, named VAR++, that associates an image with its corresponding sound command. Data collection. Suppose there are M possible intents or tasks within an environment. We collect visual-audio pairs defined as (I, S, y) from the environment, where I ∈ R n×n is the current RGB image from the robot's camera, S ∈ R l×m is the Mel Frequency Cepstral Coefficients (MFCC) Davis & Mermelstein (1980) of the sound command, and y ∈ {0, 1, ..., M } is the intent ID. We call I and S two views of an intent y. A visual-audio pair contains an image and a sound command of the same intent. For example, when an iTHOR agent sees a lit lamp, it hears the sound "Switch on the lamp" from the environment. In contrast, when the agent sees no object or is far away from all objects so that it sees many objects at once, it receives only an image and hears no sound. The image is paired with S = 0 l×m and y = M . We define this situation as an empty intent. Training VAR++. Our goal is to encode both visual and auditory signals into a joint latent space, where the embeddings from the same intents are pulled closer together than embeddings from different intents. For example, the embedding of an image with a TV turned on needs to be close to the embedding of a sound command "Turn on the TV" but far away from other irrelevant commands such as "Turn off the light." We adopt the idea from (self-)supervised contrastive learning for visual representations and formulate the problem as metric learning. As shown in Fig. 2a , the VAR++ is a double-branch network with two main components. The first component is the en-  I : R d I → R d , g S : R d S → R d , b I : R d I → R, and b S : R d S → R that map the representations h I and h S to the space where losses are applied. We denote the vector embeddings g I (h I ) and g S (h S ) as z I and z S , respectively. We enforce the norm of z I and z S to be 1 by applying an L2-normalization, such that the embeddings live on a unit hypersphere as shown in Fig. 2b . We use supervised contrastive (SupCon) loss as the objective, which encourages the distance between z I and z S of the same intent to be closer than those of a different intent Khosla et al. (2020) . Suppose there are N visual-audio pairs in a batch. Let k ∈ K := {1, ..., 2N } be the index of an image or a sound signal within that batch and P (k) := {p ∈ K \ {k} : y p = y k } be the set of indices of all images and sounds of the same intent except for index k. Then, the SupCon loss is L SupCon = - k∈K 1 |P (k)| p∈P (k) log exp (z k • z p /τ ) j∈K\{k} exp (z k • z j /τ ) , where |•| is the cardinality, z (•) can be either z I or z S , and τ ∈ R + is a scalar temperature parameter. The previous VAR uses visual-audio triplets of the form (I, S + , S -) for the training, where I and S + are an image and sound with the same intent, and S -is the sound with a different intent. The loss only pulls together the embeddings of I and S + and pushes away the embeddings of I and S -in a triplet. This setting is less efficient because each anchor only has one positive and one negative Chang et al. (2021) . In contrast, the use of SupCon loss allows the attraction and repulsion among all images and sound within a batch, which improves the performance of the representation as we will show in Sec. 4.3. We additionally introduce a binary classification loss for both the image and sound to distinguish between empty and non-empty intent. Let L BCE denote the binary cross entropy loss and e denote the label of intent, which is 0 for empty intent and 1 for non-empty intent. The batch loss for training the VAR++ is: L VAR++ = α 1 L SupCon + α 2 1 N N j=1 L BCE (b I (h I j ), e j ) + L BCE (b S (h S j ), e j ), where α 1 and α 2 are the weights of losses. Depending on if the intent is predicted empty or not, the output v I and v S of VAR++ can be determined for image and sound by: v I = 1 {ê I ≥ 0.5} z I , êI := b I (h I ), v S = 1 {ê S ≥ 0.5} z S , êS := b S (h S ), where 1 is an indicator function. We call the latent space of the output as joint space. The purpose of the binary classification is to set the image and sound embeddings of the empty intent to the center of the joint space. This centralization removes the biases caused by the location of the empty intent in the joint space, leading to better intrinsic reward generated by the VAR++. While SupCon loss and other self-supervised visual representation learning frameworks are originally only applied to image modality Chen et al. (2020) ; Khosla et al. (2020) , we extend the framework to a multi-modality setting and create a new representation for command following robots.

3.2. RL WITH VISUAL-AUDIO REPRESENTATION

The second stage of our pipeline is to train an RL agent using an intrinsic reward function generated by a trained VAR++. We model a robot command following task as a Markov Decision Process (MDP), defined by the tuple ⟨X , A, P, R, γ⟩. At each time step t, the agent receives an image I t from its RGB camera, and robot states M t such as end-effector location or previous action. At t = 0, an additional one-time sound command S g containing an intent is given to the robot. We freeze the trainable weights of VAR++ in this stage and define the MDP state x t ∈ X as x t = [I t , v I t , v S g , M t ], where v I t and v S g are the output of the VAR++ for I t and S g , respectively. The VAR++ encodes the information in the input image and the intent in S g . Then, based on its policy π(a t |x t ), the agent takes an action a t ∈ A. In return, the agent receives a reward r t ∈ R and transitions to the next state x t+1 according to an unknown state transition P (•|x t , a t ). The process continues until t exceeds the maximum episode length T , and the next episode starts. Intrinsic rewards. Since v I and v S of the same intent are pulled together within the VAR++ by the contrastive loss, intrinsic rewards can be derived as the similarity between v I and v S . Eq. 4 and 5 present two possible task-agnostic and robot-agnostic reward functions: r i t = v I t • v S g (4) r ic t = v I t • v S g + v S t • v S g (5) where v S t is the embedding of the current sound signal S t , which can be triggered in the same way as S as described in Section 3.1. Intuitively, the agent using r i t receives high reward when the scene it sees matches the command it hears. The agent trained using the reward r ic t additionally needs to match the current sound it hears with the sound command to receive high rewards. Compared to r ic t , the reward function r i t does not depend on any real-time supervision signal such as current sound v S t from the environment, allowing the agent to perform self-supervised RL training with VAR++. Although RL agents trained with Eq. 4 can already achieve decent performance, providing the current sound S t can further improve the performance Chang et al. (2021) . Since S t can be difficult to obtain especially in real environments, S t is not part of the state x t and thus the robot policy does not require S t at test time. Policy network architecture. We show our policy network architecture used in our experiments in Fig. 2c . Given the state x t , the network outputs the value V (x t ) and the policy π(a t |x t ). Instead of reusing the CNN in f I , we add another CNN to extract the features relevant for achieving the goal. For example, the iTHOR agent needs to encode information about obstacles for collision avoidance. We use an LSTM to encode the embeddings of I t and M t for long-term decision making. Proximal Policy Optimization (PPO) was used for policy and value function learning Schulman et al. (2017) .

3.3. FINE-TUNING

After the robot is deployed in a new domain such as the real world, its performance often degrades due to domain shift from both perception and dynamics Du et al. (2021) . Our fine-tuning procedure allows non-experts to continually improve the VAR++ to reduce perception gaps and improve robot skills to reduce dynamics gaps. Since performing accurate state and reward measurements, data labeling, and instrumentation requires domain expertise and is time-consuming, we assume tuned reward functions, one-hot labels, and accurate speech transcriptions are not available from nonexperts. Fortunately, our method requires none of these. To fine-tune VAR++, since we no longer have the underlying labels y for images and sounds, we replace the SupCon loss in Eq. ( 2) with the following self-supervised contrastive loss (SSC) Chen et al. (2020) : L SSC = - k∈K log exp (z k • z p(k) /τ ) j∈K\{k} exp (z k • z j /τ ) , where p(k) is the index of the data paired with the data of index k with the same intent. To fine-tune the robot, non-experts collect visual-audio pairs of the form (I, S) based on their common knowledge using their own voices. The robot can then self-improve its policy network with the intrinsic reward function by randomly sampling a collected sound command as the goal. See Appendix A for a detailed fine-tuning algorithm. To fine-tune VAR in Chang et al. (2021) , non-experts have to provide a sound command with different intent S -for each image I to use triplet loss. In contrast, VAR++ eliminates this requirement by utilizing the SSC, leading to a more intuitive data collection experience for non-experts and better performance with fewer labeled data.

4. SIMULATION EXPERIMENTS

In this section, we first describe the environments (Fig. 1 ) and various sound datasets for the experiments. Then, we compare the performance and data efficiency of our pipeline with several baselines and ablation models. We evaluate the performance of all the methods on three different robotic platforms: Turtle-Bot, Kuka, and iTHOR. In all environments, after hearing a sound command, the robots must learn exploration skills and approach the corresponding objects. All the robots are equipped with a monocular uncalibrated RGB camera for robot perception. See Appendix C for detailed descriptions and visualizations. 2014). The Wordset dataset was created from the "0," "1," "2," "3" in GSC. We also used a Mix dataset to show that the VAR++ can map multiple types of sounds to a single object or idea, by mixing speech data with environmental sound. We mix "house" with "jackhammer" and "dog" with "bark". Examples of commands in the iTHOR environment include "turn on the lights" and "pause". The iTHOR environment uses the commands from FSC, while the Kuka and the TurtleBot environments uses the commands from the other sound datasets. See Appendix B for more sound examples and intent we choose for the iTHOR environment.

4.3. EVALUATION OF THE REPRESENTATIONS

Evaluation metrics. The representations are evaluated by a linear layer (LL) and nearest neighbor (NN). For LL, we follow the widely used linear evaluation protocol, where a linear classifier is trained using a cross-entropy loss on top of the frozen encoders, which are f I and f S in our case Chen et al. (2020) ; Kolesnikov et al. (2019) ; Zhang et al. (2016) . For NN, we first find the medoids of each intent C 0 , ..., C M in the joint space using the training data. The predicted label of a test image or sound is arg max i v • C i , where v is embedding of the test image or sound in the joint space. NN measures the quality of the intrinsic reward produced by the representations. Baselines. We compare the performance of our VAR++ with VAR which uses triplet loss for both training and fine-tuning Chang et al. (2021) . For each intent, we collect the same number of visualaudio pairs (I, S, y) for VAR++ training and visual-audio triplets of the form (I, S + , S -) for VAR training, where I and S + are an image and sound with the same intent, y is the intent ID, and S -is the sound with a different intent. We kept the network architecture of both methods the same. 1 , we observe that both methods achieve high NN accuracy while VAR++ marginally outperforms VAR, suggesting that both methods are able to produce accurate and reliable rewards for the downstream RL tasks. As for LL, VAR++ is much better than VAR since even a linear classifier can achieve much higher accuracy with VAR++. Qualitative results. We visualize the VARs by projecting images and sounds to the joint space, as shown in Fig. 4 . We see that the embeddings of the same concept form a cluster and all clusters are separated from each other. Compared to VAR, the clusters in VAR++ have better intra-cluster cohesion and inter-cluster separation, suggesting that the two distinct concepts are better distinguished and the same concepts are better related. During fine-tuning, although VAR++ does not have S -as an explicit indication of negatives like VAR does in the input, VAR++ can still maintain relatively clear inter-cluster separation and provide reliable rewards for the self-improvement of RL agents. 

4.4. EVALUATION OF THE RL POLICY

Evaluation metrics. We evaluate the model with two metrics: (1) success rate (SR) and ( 2) the number of labels used for training (LU). We define SR as the percentage of successful test episodes. We test the learned policy for 50 episodes for each intent. For the iTHOR environment, an agent succeeds if it fulfills the command. For the TurtleBot and Kuka environments, a successful episode happens when the agent stays close to the target mentioned in the command for a certain time period. We compare the label usage of a model because a command following robot deployed in the real world should require as few annotations as possible from non-experts for fine-tuning. Baselines and ablations. We compare the RL performance of our method against the following baselines and ablation models. The first baseline, denoted as "E2E," is a representative end-to-end deep RL policy for command following robots Chang et al. (2020) . E2E uses hand-tuned taskspecific reward functions and requires ground-truth class labels for image and sound classification. The second baseline, denoted as "VAR," trains an RL agent based on the output of the VAR Chang et al. (2021) . VAR utilizes triplet loss for the training and fine-tuning. Both our method and VAR use Eq 5 for the downstream RL tasks. We mark a model with "Centered (C)" or "Not centered (NC)"to indicate if the image and sound embeddings of the empty intent are set to the center of the hypersphere in the joint space. The original VAR method does not centralize the empty intent. The third baseline, denoted as "ASR+NLU+RL (ANR)," is a common modular pipeline. We first use an off-the-shelf automatic speech recognition (ASR) named Mozilla DeepSpeech Hannun et al. (2014) to transcribe the speech to text. We then train a learning-based natural language understanding (NLU) module to handle the noisy output from the ASR. For example, "Play the music" is sometimes transcribed as "by the music." Finally, a vision-based RL agent operates with the predicted intent from the NLU. Note that unlike this baseline, our method does not rely on any transcriptions or expertise to be fine-tuned. This baseline does not work with non-speech datasets such as NSynth. See Appendix D for more facts. Control policies with unheard sounds. In this experiment, we test the performance of different models with sound commands never heard by the agent during training (e.g. new speakers). All the models were trained with the same number of RL steps and sufficient labels. For iTHOR environment, we trained the agents for 9 million (M) RL steps and tested them within the seen floor plans (Floor Plan 201 -220). For Turtlebot and Kuka environments, the total RL steps is 3M. No fine-tuning is performed yet. Table 2 and Table 3 show that the application of our method is not limited to a specific robot, robotic task, or types of sound signal. In all environments, compared to the baselines, our method achieves the highest SR. In iTHOR environment, our method achieves the highest SR and the lowest LU. Although no limit was imposed on LU in this experiment, ASR+NLU+RL and E2E require much more labels during the training than VAR and our method. The results also suggest that the intrinsic rewards produced by the representations are sufficient for the RL training, since VAR and our method both demonstrate satisfying performance without receiving any extrinsic rewards. From Table 2 and 3 , the SR for ASR+NLU+RL baseline is lower than most of the other methods. The main reason is that the system suffers from intermediate and cascading errors among different modules, which coincides with the findings in Chang et al. (2021) ; Tada et al. (2020) . The last four rows of Table 3 indicate the improvement by centralizing the empty intent for both VAR and our method. This result justifies the necessity of the binary classification loss in Eq. 2. See Appendix E for examples of task execution of the agent and Appendix F for time efficiency measurements. Fine-tuning in novel iTHOR floor plans. This experiment aims to show the potential of each method to be improved in a new domain. We consider the scenario where a trained household robot is purchased to serve in a new room with a unfamiliar set of furniture and arrangement. Each method is given the same number of new labels, and a data efficient method should achieve the highest success rate. We first test the performance of trained models with unheard sound commands in 5 unseen iTHOR floor plans without any fine-tuning. This process uses 0 new labels. The first three columns of Table 4 show the necessity of fine-tuning: the performance of all methods drops due to the consequence of domain shift, which is a common problem for learning systems Tobin et al. (2017) . We then use 2400 new labels for each unseen floor plan to fine-tune each method for that floor plan. For our method, each intent has 400 new labels on average because there are 6 intents for our iTHOR environment. We followed Sec. 3.3 to fine-tune the VAR and VAR++ and used Eq. 4 to self-improve RL policies without current sounds. For E2E, we collect one-hot labels and use simulator queries during the fine-tuning. The fine-tuning is terminated after it reaches the label limit. See Appendix E.4 for comparison of task execution before and after the fine-tuning. From Table 4 , we find that the ANR and end-to-end method can only be improved by 5.2% using 2400 labels, suggesting the inefficiency of fine-tuning E2E after deployment. The label quotas are depleted rapidly due to the inefficient use of labels for policy network fine-tuning, which leads to less RL experience. VAR and our method improve itself by 49.6% and 65.2%, respectively, using the same amount of labels after 1M of self-supervised RL training steps. The richer RL experience was due to the higher data efficiency of our method because the labels were used to update VAR++, and there was no label consumption during the selfsupervised RL exploration. Compared to VAR, our method achieves better performance because VAR++ does not need negative pairs for finetuning. This property allows the VAR++ to achieve almost the same RL performance as VAR using only half as many labels, since Chang et al. (2021) reports that the SR for VAR with 5000 new labels is 84.7%. We further show the relation between the number of RL steps and the number of newly collected pairs in two randomly selected unseen iTHOR Floor Plans (226 and 229). In Table 5 , we see that our method is still effective when the number of new pairs and the self-supervised RL steps are much fewer than 2400 and 1M -even when no new pairs are collected. More visual-audio pairs and more RL steps allow the agent to improve faster and reach higher success rates. Fine-tuning in new Kuka environment. This experiment shows that our method can handle dynamics gaps and adapt to unseen objects. We first train the agent in the original Kuka environment with four identical blocks. At test time, we change the link mass, the joint friction, and parameters of the robot's PID controller. In addition, as shown in Fig. 7 in Appendix C.2, we also replace three of the blocks to a capsule, a teddy bear, and a rabbit. Without fine-tuning, our method achieves 69.5% SR. This result suggests that the VAR++ successfully encodes the most essential spatial information and can generalize to unseen objects with different shapes. We then fine-tune the agent following Sec. 3.3 with 1800 visual-audio pairs and 0.5M RL steps. The final SR raises to 96.5%, which demonstrates the adaptability of our method to novel objects and changes in dynamics.

5. FUTURE WORK AND DISCUSSION

In conclusion, we propose a novel visual-audio representation named VAR++ for command following robots based on the recent advancement in (self-)supervised contrastive learning. VAR++ requires much fewer labels from non-experts during fine-tuning but produces higher-quality rewards for downstream RL agents. Our results suggest that visual-language association and skill development are highly correlated and thus need to be designed together. Furthermore, we are the first to demonstrate that (self-)supervised contrastive loss has the potential in enhancing human-robot interaction (HRI) experiences. Such a natural human-robot interaction can promote human perception and adoption of robotic systems and marks one step towards practical social robot applications. However, our work encompasses the following limitations, which opens up directions for future work. (1) Empty intents may result in a sparse intrinsic reward function, which poses challenges in solving long horizon tasks. To solve this, our reward function can be combined with other intrinsic rewards Burda et al. (2019) . ( 2) We only apply our method to vision-based command following robots in this paper. It is a promising direction to extend the method to other modalities and provide reward function for other goal-based multi-modal robot tasks.

A ALGORITHM FOR FINE-TUNING AN AGENT

This section shows the detailed algorithm for fine-tuning the VAR++ and an RL agent. Algorithm 1 Fine-tuning Calculate image and sound embeddings: h I , z I , h S , z S ← V(I i , S i ) 6: Calculate L SSC by Eq. 6 7: Calculate loss by L finetune = α 1 L SSC + α 2 1 N N j=1 L BCE (b I (h I j ), e j ) + L BCE (b S (h S j ), e j ) 8: Update V to minimize L finetune 9: for k = 0, 1, 2, ... do ▷ Self-supervised RL fine-tuning 

Dataset Sound Examples

FSC activate light "Turn on the lights," "Lamp on" deactivate light "Switch off the lamp," "Lights off" activate music "Put on the music," "Play" deactivate music "Pause music," "Stop" bring shoes "Get me my shoes," "Bring shoes" GSC "0," "1," "2," "3" "zero," "one," "two," "three" names of 4 objects "house" "tree," "bird," "dog" NSynth C 4 , D 4 , E 4 , F 4 Various instruments, tempo, and volume US8K bark, jackhammer Sound recorded in the wild

C ROBOTIC ENVIRONMENT DESCRIPTIONS

The Turtlebot and Kuka environments are developed in PyBullet Coumans & Bai (2016 -2019) and mainly posed challenges in fine motor control with moderate difficulty in perception. In contrast, the iTHOR environment is developed in AI2-THOR Kolve et al. (2017) and is challenging in perception with discretized and simplified control. C.2 KUKA Four identical blocks, each associated with a sound command, are placed in a line at a random location on the table. The robot needs to move its gripper above the block corresponding to a given command based on RGB images. The camera is placed at a fixed location on the side of the table such that it can capture the gripper and blocks from a distorted perspective. The relative positions of the gripper tip and the blocks are initialized randomly at the beginning of an episode. The robot needs to develop spatial reasoning skills to approach the target block using the relative positional information observed from the camera. 



Figure 1: Illustration of our pipeline. Contrastive learning is used to group images and audio commands of the same intent. The resulting representation VAR++ supports the downstream RL training by encoding the high-dimensional voice and image signals, and providing reward signals and states to the agent.

Figure 2: Network architectures. (a) The VAR++ is a double-branch network optimized with (self-)supervised contrastive loss. (b) The latent space of the VAR++ is a unit hypersphere such that the images and audios of the same intent are closer than those of different intent in the space. (c) The policy network for RL training. The portion in blue is VAR++ which is frozen during the RL training. We use to denote element-wise addition, F C to denote fully connected layers, and [••] to denote concatenation. coders f I : R n×n → R d I and f S : R l×m → R d S which map an input image I and a sound signal S to representation vectors h I and h S , respectively. In practice, any deep models for image and sound processing can be used for f I and f S . The second component is the projection heads g I : R d I → R d , g S : R d S → R d , b I : R d I → R, and b S : R d S → R that map the representations h I and h S to the space where losses are applied. We denote the vector embeddings g I (h I ) and g S (h S ) as z I and z S , respectively. We enforce the norm of z I and z S to be 1 by applying an L2-normalization, such that the embeddings live on a unit hypersphere as shown in Fig.2b.

Figure 3: Simulation environments for the experiments.

SOUND DATA We use several types of sounds from state-of-the-art datasets in training and testing. Specifically, we use speech signals collected for training the end-to-end SLU from Fluent Speech Commands (FSC) Lugosch et al. (2019) and short speech commands from Google Speech Commands (GSC) Warden (2018). We also collect single-tone signals from NSynth Engel et al. (2017) and urban & environmental sounds from UrbanSound8K (US8K) Salamon et al. (

Figure 4: Visualizations of the VARs in the iTHOR environments with FSC. The colors indicate the ground truth intent ID of embeddings of sound (marked by triangles) and image (marked by circles). (a) VAR after the training. (b) VAR after the fine-tuning. (c) VAR++ after the training. (d) VAR++ after the fine-tuning.

Inputs: A trained VAR++ V, and a trained policy π θ 2: Collect a small set of visual-audio pairs D = {(I i , S i )} U i=1 3: for a sampled minibatch {(I i , S i )} N i=1 from D do ▷ Fine-tune VAR++ 4:Calculate empty intent label e i by checking if S i = 0 l×m 5:

TURTLEBOTFour objects -a cube, sphere, cone, and cylinder -are placed in a 4m 2 space. Each object has an associated intent. The goal of the robot is to navigate to the object corresponding to the given sound command based on RGB images. The robot and the four objects are placed randomly in the arena at the beginning of an episode. The robot's action is the change of desired transitional velocity and the change of desired orientation. The robot needs to develop exploration skills to discover the goal object in the shortest period.

Figure 5: Visualization of the TurtleBot environment with paired images and voices from the Wordset. In this case, "zero" means cube, "one" means sphere, "two" means cone, and "three" means cylinder. The red and green rays are just for illustration purposes. The right most figure shows the camera view.

Figure 6: Visualization of the Kuka environment with paired images and voices from the Wordset. In this case, "zero" means the leftmost block, "one" means the second block from the left, and so on. The red and green rays are just for illustration purposes. The right most figure shows the camera view.

Figure 7: Visualization of the new objects for the Kuka fine-tuning experiment.

Figure 8: Visualization of the iTHOR environment with paired images and voices from the FSC.

Percentage accuracy of VARs with nearest neighbor (NN) and linear layer (LL).

Test success rate results in Kuka and TurtleBot environments with different sounds. S + , S -), requires 2 labels to indicate the positive and the negative. Every E2E training step requires 3 labels, including the target object state checking (e.g. check if the light is switched on), distance measuring to calculate the extrinsic reward, and a one-hot label for auxiliary losses.

Train label-usage and test success rate results in iTHOR 201-220 with FSC dataset.

Average success rates over unseen iTHOR Floor Plan 226 -230 after fine-tuning with additional label-usage.

Model performance and the number of visual-audio pairs collected for the fine-tuning

Sound signals used in the experiments.

E.4 ITHOR FINE-TUNING

Figure 12 : Visualization of the task execution in the iTHOR environment before and after the fine-tuning in unseen floor plans and the sound commands given by new speakers. The sounds come from FSC dataset.

F TIME EFFICIENCY

In this section, we evaluate the time efficiency of all the methods. All the models are running on a single Nvidia GTX 1080 Ti GPU and a Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz. We report the average time in second (s) for the model to take one action in the iTHOR environment with the FSC dataset. The average is calculated from 12500 samples.• ANR: 0.041s • E2E: 0.018s • VAR: 0.024s • VAR++: 0.022s

