UNSUPERVISED MODEL-BASED PRE-TRAINING FOR DATA-EFFICIENT CONTROL FROM PIXELS

Abstract

Controlling artificial agents from visual sensory data is an arduous task. Reinforcement learning (RL) algorithms can succeed in this but require large amounts of interactions between the agent and the environment. To alleviate the issue, unsupervised RL proposes to employ self-supervised interaction and learning, for adapting faster to future tasks. Yet, whether current unsupervised strategies improve generalization capabilities is still unclear, especially in visual control settings. In this work, we design an unsupervised RL strategy for data-efficient visual control. First, we show that world models pre-trained with data collected using unsupervised RL can facilitate adaptation for future tasks. Then, we analyze several design choices to adapt faster, effectively reusing the agents' pre-trained components, and planning in imagination, with our hybrid planner, which we dub Dyna-MPC. By combining the findings of a large-scale empirical study, we establish an approach that strongly improves performance on the Unsupervised RL Benchmark, requiring 20× less data to match the performance of supervised methods. The approach also demonstrates robust performance on the Real-Word RL benchmark, hinting that the approach generalizes to noisy environments.

1. INTRODUCTION

Modern successes of deep reinforcement learning (RL) have shown promising results for control problems (Levine et al., 2016; OpenAI et al., 2019; Lu et al., 2021) . However, training an agent for each task individually requires a large amount of task-specific environment interactions, incurring huge redundancy and prolonged human supervision. Developing algorithms that can efficiently adapt and generalize to new tasks has hence become an active area of research in the RL community. In computer vision and natural language processing, unsupervised learning has enabled training models without supervision to reduce sample complexity on downstream tasks (Chen et al., 2020; Radford et al., 2019) . In a similar fashion, unsupervised RL (URL) agents aim to learn about the environment without the need for external reward functions, driven by intrinsic motivation (Pathak et al., 2017; Burda et al., 2019a; Bellemare et al., 2016) . Any learned models can then be adapted to downstream tasks, aiming to reduce the required amount of interactions with the environment. Recently, the Unsupervised RL Benchmark (URLB) (Laskin et al., 2021) established a common protocol to compare self-supervised algorithms across several domains and tasks from the DMC Suite (Tassa et al., 2018) . In the benchmark, an agent is allowed a task-agnostic pre-training stage, where it can interact with the environment in an unsupervised manner, followed by a fine-tuning stage where, given a limited budget of interactions with the environment, the agent should quickly adapt for a specific task. However, the results obtained by Laskin et al. (2021) suggest that current URL approaches may be insufficient to perform well on the benchmark, especially when the inputs of the agent are pixel-based images. World models have proven highly effective for solving RL tasks from vision both in simulation (Hafner et al., 2021; 2019a) and in robotics (Wu et al., 2022) , and they are generally data-efficient as they enable learning behavior in imagination (Sutton, 1991) . Inspired by previous work on exploration (Sekar et al., 2020) , we hypothesize this feature could be key in the unsupervised RL setting, as a pre-trained world model can leverage previous experience to learn behavior for new tasks in imagination, and in our work, we study how to best exploit this feature. We adopt the URLB setup to perform a large-scale study, involving several unsupervised RL methods for pre-training model-based agents, different fine-tuning strategies, and a new improved algorithm for efficiently planning with world models. The resulting approach, which combines the findings of our study, strongly improves performance on the URL benchmark from pixels, nearly achieving the asymptotic performance of supervised RL agents, trained with 20x more task-specific data, and bridging the gap with low-dimensional state inputs (Laskin et al., 2021) . Contributions. This work does not propose a novel complex method. Rather, we study the interplay of various existing components and propose a novel final solution that outperforms existing state of the art on URLB by a staggering margin. Specifically: • we demonstrate that unsupervised RL combined with world models can be an effective pre-training strategy to enable data-efficient visual control (Section 3.1), • we study the interplays between the agent's pre-trained components that improve sample efficiency during fine-tuning (Section 3.2), • we propose a novel hybrid planner we call Dyna-MPC, which allows us to effectively combine behaviors learned in imagination with planning (Section 3.3), • combining our findings into one approach, we outperform previous approaches on URLB from pixels, nearly solving the benchmark (Section 4.1), • we show the approach is resilient to environment perturbations, evaluating it on the Real World RL benchmark (Dulac-Arnold et al., 2020) (Section 4.2), • we present an extensive analysis of the pre-trained agents, aimed at understanding in-depth the current findings and limitations (Section 4.3). An extensive empirical evaluation, supported by more than 2k experiments, among main results, analysis and ablations, was used to carefully design our method. We hope that our large-scale evaluation will inform future research towards developing and deploying pre-trained agents that can be adapted with considerably less data to more complex/realistic tasks, as it has happened with unsupervised pre-trained models for vision (Parisi et al., 2022) and language (Ahn et al., 2022).foot_0 

2. PRELIMINARIES

Reinforcement learning. The RL setting can be formalized as a Markov Decision Process (MDP), denoted with the tuple {S, A, T, R, γ}, where S is the set of states, A is the set of actions, T is the state transition dynamics, R is the reward function, and γ is a discount factor. The objective of an RL agent is to maximize the expected discounted sum of rewards over time for a given task, also called return, and indicated as G t = T k=t+1 γ (k-t-1) r k . In continuous-action settings, you can learn an actor, i.e. a model predicting the action to take from a certain state, and a critic, i.e. a model that estimates the expected value of the actor's actions over time. Actor-critic algorithms can be combined with the expressiveness of neural network models to solve complex continuous control tasks (Haarnoja et al., 2018; Lillicrap et al., 2016; Schulman et al., 2017) .



The PyTorch code for the experiments will be open-sourced upon publication.



Figure 1: Method overview. Our method considers a pre-training (PT) and a fine-tuning (FT) stage. During pre-training, the agent interacts with the environment through unsupervised RL, maximizing an intrinsic reward function, and concurrently training a world model on the data collected. During fine-tuning, the agent exploits its pre-trained components and plans in imagination, to efficiently adapt to different downstream tasks, maximizing the rewards received from the environment.

