TOWARD EFFECTIVE DEEP REINFORCEMENT LEARN-ING FOR 3D ROBOTIC MANIPULATION: MULTIMODAL END-TO-END REINFORCEMENT LEARNING FROM VI-SUAL AND PROPRIOCEPTIVE FEEDBACK

Abstract

Sample-efficient reinforcement learning (RL) methods capable of learning directly from raw sensory data without the use of human-crafted representations would open up real-world applications in robotics and control. Recent advances in visual RL have shown that learning a latent representation together with existing RL algorithms closes the gap between state-based and image-based training. However, image-based training is still significantly sample-inefficient with respect to learning in 3D continuous control problems (for example, robotic manipulation) compared to state-based training. In this study, we propose an effective model-free off-policy RL method for 3D robotic manipulation that can be trained in an end-to-end manner from multimodal raw sensory data obtained from a vision camera and a robot's joint encoders. Most notably, our method is capable of learning a latent multimodal representation and a policy in an efficient, joint, and end-to-end manner from multimodal raw sensory data, without the need for human-crafted representations or prior expert demonstrations. Our method, which we dub MERL: Multimodal End-to-end Reinforcement Learning, results in a simple but effective approach capable of significantly outperforming both current state-of-the-art visual RL and state-based RL methods with respect to sample efficiency, learning performance, and training stability in relation to 3D robotic manipulation tasks from DeepMind Control.

1. INTRODUCTION

Deep reinforcement learning (deep RL), the effective combination of RL and deep learning, has allowed RL methods to attain remarkable results across a wide range of domains, including board and video games under discrete action space (Mnih et al., 2015; Silver et al., 2017; Vinyals et al., 2019) and robotics and control under continuous action space (Levine et al., 2016; Zhu et al., 2020; Kalashnikov et al., 2021; Ibarz et al., 2021; Kroemer et al., 2021) . Many deep RL studies use human-crafted representations, for it is commonly known that state-based training operating on a coordinate state is significantly more sample-efficient than raw sensory data-based training (for example, image-based training). However, the use of human-crafted representations poses several major limitations and issues for 3D robotic manipulation: (a) human-crafted representations cannot perfectly represent the robot environment; (b) in the case of real-world applications, a separate module for environmental perception is required to obtain the environment state; and (c) state-based training based on human-crafted representations is not capable of using a neural network architecture repeatedly for different tasks, even for small variations in the environment such as changes in the number, size, or shape of objects. Over the last three years, the RL community has made significant headway on these limitations and issues by significantly improving sample efficiency in image-based training (Hafner et al., 2019a; b; 2020; 2022; Lee et al., 2020a; Srinivas et al., 2020; Yarats et al., 2020; 2021a; b; c) . A key insight of such studies is that the learning of better low-dimensional representations from image pixels is achieved through an autoencoder (Yarats et al., 2021c ), variational inference (Hafner et al., 2019a; b; 2020; 2022; Lee et al., 2020a) , contrastive learning (Srinivas et al., 2020; Yarats et al., 2021b) , or data augmentation (Yarats et al., 2020; 2021a) , which in turn has helped to improve sample efficiency significantly. Recently, some visual RL studies have solved 3D continuous control problems such as quadruped and humanoid locomotion tasks from the DeepMind Control (DMC) suite, which in turn has helped to bridge the gap between state-based and image-based training (Hafner et al., 2020; Yarats et al., 2021a) . Despite such significant progress in visual RL, image-based training is still notably sample-inefficient with respect to learning in 3D continuous control problems (for example, robotic manipulation) compared to state-based training. In this paper, we propose an effective deep RL method for 3D robotic manipulation, which we dub MERL: Multimodal End-to-end Reinforcement Learning. MERL is a simple but effective model-free RL method that can be trained in an end-to-end manner from two different types of raw sensory data (RGB images and proprioception) having a multimodality that differs in dimensionality and value range. Most notably, MERL is capable of learning a latent multimodal representation and a policy in an efficient, joint, and end-to-end manner from multimodal raw sensory data, without the need for human-crafted representations or prior expert demonstrations. Compared to current state-of-the-art visual RL and state-based RL methods, MERL provides significant improvements in sample efficiency, learning performance, and training stability in relation to five 3D robotic manipulation tasks from DMC (jaco-reach-duplo, jaco-move-box, jaco-lift-box, jaco-push-box-withobstacle, and jaco-pick-and-stack) (Tunyasuvunakool et al., 2020) . In addition, MERL solves each of the three complex 3D humanoid locomotion tasks from DMC (humanoid-stand, humanoid-walk, and humanoid-run) within 5M environment steps, whereas the state-of-the-art visual RL, DrQ-v2 (Yarats et al., 2021a) , solves the same tasks within 30M environment steps. To the best of our knowledge, MERL is the first model-free off-policy method not only to learn a latent multimodal representation and a policy in an efficient, joint, and end-to-end manner from multimodal raw sensory data, but also to show a new state-of-the-art performance by significantly outperforming both current state-of-the-art visual RL and state-based RL methods with respect to sample efficiency, learning performance, and training stability. The main contributions of the paper can be summarized as follows: (1) the introduction of an endto-end approach learning directly from multimodal raw sensory data for the efficient learning of a policy for use in the field of 3D robotic manipulation; (2) a demonstration of the fact that the approach significantly outperforms current state-of-the-art visual RL and state-based RL methods with respect to sample efficiency, learning performance, and training stability in relation to five 3D robotic manipulation tasks from DMC; (3) a demonstration of the superiority of the approach through the performance of complex 3D humanoid locomotion tasks from DMC within 5M environment steps; and (4) the provision of a deep RL method capable of learning a latent multimodal representation and a policy in an efficient, joint, and end-to-end manner from multimodal raw sensory data, without the need for human-crafted representations or prior expert demonstrations.

2.1. REINFORCEMENT LEARNING FOR 3D ROBOTIC MANIPULATION

Deep RL has seen widespread success across a variety of domains, including board and video games and robotics and control (Mnih et al., 2015; Silver et al., 2017; Vinyals et al., 2019; Levine et al., 2016; Zhu et al., 2020; Kalashnikov et al., 2021; Ibarz et al., 2021; Kroemer et al., 2021) . In recent years, a number of deep RL methods have been successfully applied to 3D robotic manipulation tasks ranging from 'easy' (for example, reach-target) to 'hard' (for example, assembly). Of these methods, some require hand-engineered components for perception, state estimation, and low-level control, for they learn from human-crafted representations (for example, (Yamada et al., 2020; Lee et al., 2019b; 2021b; Nam et al., 2022) ), and such are commonly referred to as state-based RL. Over the last three years, visual RL that learns directly from image pixels has been greatly advanced and has seen significant improvements in sample efficiency in 3D continuous control problems, which in turn has helped to bridge the gap between state-based and image-based training (Hafner et al., 2019a; b; 2020; 2022; Srinivas et al., 2020; Yarats et al., 2020; 2021a; b; c; Wu et al., 2022) . A key insight of such studies is that the learning of better low-dimensional representations from image pixels is achieved through various techniques such as an autoencoder, variational inference, contrastive learning, or data augmentation. Currently, DrQ-v2 (Yarats et al., 2021a) shows a state-of-the-art

