MODEL-FREE REINFORCEMENT LEARNING THAT TRANSFERS USING RANDOM FEATURES

Abstract

Reinforcement learning (RL) algorithms have the potential not only for synthesizing complex control behaviors, but also for transfer across tasks. Typical modelfree RL algorithms are usually good at solving individual problems with high dimensional state-spaces or long horizons, but can struggle to transfer across tasks with different reward functions. Model-based RL algorithms, on the other hand, naturally enable transfer across different reward functions, but struggle to scale to settings with long horizons and/or high dimensional observations. In this work, we propose a new way to transfer behaviors across tasks with different reward functions, displaying the benefits of model-free RL algorithms with the transferability of model-based RL. In particular, we show how a careful combination of model-free RL using randomly sampled features as reward is able to implicitly model long-horizon environment dynamics. Model-predictive control using these implicit models enables quick adaptation to problems with new reward functions, while scaling to problems with high dimensional observations and long horizons. Our method can be trained on offline datasets without reward labels, and quickly deployed on new tasks, making it more widely applicable than typical methods for both model-free and model-based RL. We validate that our proposed algorithm enables transfer across tasks in a variety of robotics and analytic domains.

1. INTRODUCTION

Reinforcement learning (RL) algorithms have been shown to successfully synthesize complex behavior in single-task sequential decision-making problems [1, 2, 3] , but more importantly have the potential for broad generalization across problems. However, many RL algorithms are deployed as specialists -they solve single tasks and are not prepared for reusing their interactions. In this work, we specifically focus on the problem of transferring information across problems where the environment dynamics are shared, but the reward function is changing. This problem setting is reflective of a number of scenarios that may be encountered in real-world settings such as robotics. For instance, in tabletop robotic manipulation, different tasks like pulling an object, pushing an object, picking it up, and pushing to different locations, all share the same transition dynamics, but involve a changing reward function. We hence ask the question -can we reuse information across these tasks in a way that scales to high dimensional, longer horizon problems? When considering how to tackle this problem, a natural possibility is to consider direct policy search [4, 5] . Typical policy search algorithms can achieve good performance for solving a single task, but entangle the dynamics and reward, in the sense that the policy one searches for is optimal for a particular reward but may be highly suboptimal in new scenarios. Other model-free RL algorithms like actor-critic methods [6, 7, 8] or Q-learning [9, 1] may exacerbate this issue, with learned Q-functions entangling dynamics, rewards, and policies. For new scenarios, an ideal algorithm should be able to disentangle and retain the elements of shared dynamics, while being able to easily substitute in new rewards. A natural fit to disentangle dynamics and rewards are model-based RL algorithms [10, 11, 12, 13, 14] . These algorithms usually learn a single-step model of transition dynamics and leverage this learned model to perform planning [15, 12, 11, 16] . These models are naturally modular and can be used to re-plan behaviors for new rewards. However, one-step dynamics models are brittle and suffer from challenges in compounding error [17, 18] .

