MODEL-FREE REINFORCEMENT LEARNING THAT TRANSFERS USING RANDOM FEATURES

Abstract

Reinforcement learning (RL) algorithms have the potential not only for synthesizing complex control behaviors, but also for transfer across tasks. Typical modelfree RL algorithms are usually good at solving individual problems with high dimensional state-spaces or long horizons, but can struggle to transfer across tasks with different reward functions. Model-based RL algorithms, on the other hand, naturally enable transfer across different reward functions, but struggle to scale to settings with long horizons and/or high dimensional observations. In this work, we propose a new way to transfer behaviors across tasks with different reward functions, displaying the benefits of model-free RL algorithms with the transferability of model-based RL. In particular, we show how a careful combination of model-free RL using randomly sampled features as reward is able to implicitly model long-horizon environment dynamics. Model-predictive control using these implicit models enables quick adaptation to problems with new reward functions, while scaling to problems with high dimensional observations and long horizons. Our method can be trained on offline datasets without reward labels, and quickly deployed on new tasks, making it more widely applicable than typical methods for both model-free and model-based RL. We validate that our proposed algorithm enables transfer across tasks in a variety of robotics and analytic domains.

1. INTRODUCTION

Reinforcement learning (RL) algorithms have been shown to successfully synthesize complex behavior in single-task sequential decision-making problems [1, 2, 3] , but more importantly have the potential for broad generalization across problems. However, many RL algorithms are deployed as specialists -they solve single tasks and are not prepared for reusing their interactions. In this work, we specifically focus on the problem of transferring information across problems where the environment dynamics are shared, but the reward function is changing. This problem setting is reflective of a number of scenarios that may be encountered in real-world settings such as robotics. For instance, in tabletop robotic manipulation, different tasks like pulling an object, pushing an object, picking it up, and pushing to different locations, all share the same transition dynamics, but involve a changing reward function. We hence ask the question -can we reuse information across these tasks in a way that scales to high dimensional, longer horizon problems? When considering how to tackle this problem, a natural possibility is to consider direct policy search [4, 5] . Typical policy search algorithms can achieve good performance for solving a single task, but entangle the dynamics and reward, in the sense that the policy one searches for is optimal for a particular reward but may be highly suboptimal in new scenarios. Other model-free RL algorithms like actor-critic methods [6, 7, 8] or Q-learning [9, 1] may exacerbate this issue, with learned Q-functions entangling dynamics, rewards, and policies. For new scenarios, an ideal algorithm should be able to disentangle and retain the elements of shared dynamics, while being able to easily substitute in new rewards. A natural fit to disentangle dynamics and rewards are model-based RL algorithms [10, 11, 12, 13, 14] . These algorithms usually learn a single-step model of transition dynamics and leverage this learned model to perform planning [15, 12, 11, 16] . These models are naturally modular and can be used to re-plan behaviors for new rewards. However, one-step dynamics models are brittle and suffer from challenges in compounding error [17, 18] . In this work, we ask -can we build reinforcement learning algorithms that disentangle dynamics, rewards, and policies for transfer across problems but retain the ability to solve problems with high dimensional observations and long horizons? In particular, we propose an algorithm that can train on large offline datasets of transitions in an environment at training time to implicit model transition dynamics, and then quickly perform decision making on a variety of different new tasks with varying reward functions that may be encountered at test time. Specifically, we propose to model the long-term behavior of randomly chosen basis functions (often called cumulants) of the environment state and action, under open-loop control, using what we term Q-basis functions. These Q-basis functions can be easily recombined to infer the true Q function for tasks with arbitrary rewards by simply solving a linear regression problem. Intuitively, this suggests that rather than predicting the evolution of the entire state step by step, predicting the accumulated long-term future of many random features of the state contains information equivalent to a dynamics model, thereby forming an "implicit model" that can transfer. These implicit models scale better with horizon and environment dimensionality than typical one-step dynamics models, while retaining the benefits of transferability and modularity. Our proposed algorithm Random Features for Model-Free Planning (RaMP) allows us to leverage an unlabelled offline dataset to learn reward-agnostic implicit models that can quickly solve new tasks involving different reward functions in the same shared environment dynamics. We show the efficacy of this method on a number of tasks for robotic manipulation and locomotion in simulation, and highlight how RaMP provides a more general paradigm than typical generalizations of modelbased or model-free reinforcement learning.

1.1. RELATED WORK

Model-based RL is naturally suited for this transfer learning setting, by explicitly learning a model of the transition dynamics and the reward function [12, 15, 11, 19, 16, 20, 21] . These models are typically learned via supervised learning on one-step transitions and are then used to extract control actions via planning [22, 23] or trajectory optimization [24, 25, 26] . The key challenge in scaling lies in the fact that they sequentially feed model predictions back into the model for sampling [27, 18, 17] . This can often lead to compounding errors [17, 18, 28] , which grows with the horizon length unfavorably. In contrast, our work does not require autoregressive sampling, but directly models long term behavior, and is easier to scale to longer horizons and higher dimensions. On the other hand, model-free RL often avoids the challenge of compounding error by directly modeling either policies or Q-values [4, 29, 5, 30, 1, 7 ] and more easily scales to higher dimensional state spaces [1, 31, 5] . However, this entangles rewards, dynamics, and policies, making it challenging to directly use for transfer. While certain attempts have been made at building model-free methods that generalize across rewards, such as goal-conditioned value functions [32, 33, 34, 35] or multi-task policies [36, 37] , they only apply to restricted classes of reward functions and particular training distributions. Our work aims to obtain the best of both worlds (model-based and model-free RL), learning a disentangled representation of dynamics that is independent of rewards and policies, but using a model-free algorithm for learning. Our notion of long-term dynamics is connected to the notion of state-action occupancy measure [38, 39] , often used for off-policy evaluation and importance sampling methods in RL. These methods often try to directly estimate either densities or density ratios [14, 38, 39] . Our work simply learns the long-term accumulation of random features, without requiring any notion of normalized densities. Perhaps most closely related work to ours is the framework of successor features, that considers transfer from a fixed set of source tasks to new target tasks [40, 41, 42, 43] . Like our work, the successor features framework leverages linearity of rewards to disentangle long-term dynamics from rewards using model-free RL. However, transfer using successor features is critically dependent on choosing (or learning) the right featurization and entangles the policy. Our work leverages random features and open-loop policies to allow for transfer across arbitrary policies and rewards.

2. BACKGROUND AND SETUP

Formalism: We consider the standard Markov decision process (MDP) as characterized by a tuple M = (S, A, T , R, γ, µ), with state space S, action space A, transition dynamics T : S × A →

