MODEL-FREE REINFORCEMENT LEARNING THAT TRANSFERS USING RANDOM FEATURES

Abstract

Reinforcement learning (RL) algorithms have the potential not only for synthesizing complex control behaviors, but also for transfer across tasks. Typical modelfree RL algorithms are usually good at solving individual problems with high dimensional state-spaces or long horizons, but can struggle to transfer across tasks with different reward functions. Model-based RL algorithms, on the other hand, naturally enable transfer across different reward functions, but struggle to scale to settings with long horizons and/or high dimensional observations. In this work, we propose a new way to transfer behaviors across tasks with different reward functions, displaying the benefits of model-free RL algorithms with the transferability of model-based RL. In particular, we show how a careful combination of model-free RL using randomly sampled features as reward is able to implicitly model long-horizon environment dynamics. Model-predictive control using these implicit models enables quick adaptation to problems with new reward functions, while scaling to problems with high dimensional observations and long horizons. Our method can be trained on offline datasets without reward labels, and quickly deployed on new tasks, making it more widely applicable than typical methods for both model-free and model-based RL. We validate that our proposed algorithm enables transfer across tasks in a variety of robotics and analytic domains.

1. INTRODUCTION

Reinforcement learning (RL) algorithms have been shown to successfully synthesize complex behavior in single-task sequential decision-making problems [1, 2, 3] , but more importantly have the potential for broad generalization across problems. However, many RL algorithms are deployed as specialists -they solve single tasks and are not prepared for reusing their interactions. In this work, we specifically focus on the problem of transferring information across problems where the environment dynamics are shared, but the reward function is changing. This problem setting is reflective of a number of scenarios that may be encountered in real-world settings such as robotics. For instance, in tabletop robotic manipulation, different tasks like pulling an object, pushing an object, picking it up, and pushing to different locations, all share the same transition dynamics, but involve a changing reward function. We hence ask the question -can we reuse information across these tasks in a way that scales to high dimensional, longer horizon problems? When considering how to tackle this problem, a natural possibility is to consider direct policy search [4, 5] . Typical policy search algorithms can achieve good performance for solving a single task, but entangle the dynamics and reward, in the sense that the policy one searches for is optimal for a particular reward but may be highly suboptimal in new scenarios. Other model-free RL algorithms like actor-critic methods [6, 7, 8] or Q-learning [9, 1] may exacerbate this issue, with learned Q-functions entangling dynamics, rewards, and policies. For new scenarios, an ideal algorithm should be able to disentangle and retain the elements of shared dynamics, while being able to easily substitute in new rewards. A natural fit to disentangle dynamics and rewards are model-based RL algorithms [10, 11, 12, 13, 14] . These algorithms usually learn a single-step model of transition dynamics and leverage this learned model to perform planning [15, 12, 11, 16] . These models are naturally modular and can be used to re-plan behaviors for new rewards. However, one-step dynamics models are brittle and suffer from challenges in compounding error [17, 18] . In this work, we ask -can we build reinforcement learning algorithms that disentangle dynamics, rewards, and policies for transfer across problems but retain the ability to solve problems with high dimensional observations and long horizons? In particular, we propose an algorithm that can train on large offline datasets of transitions in an environment at training time to implicit model transition dynamics, and then quickly perform decision making on a variety of different new tasks with varying reward functions that may be encountered at test time. Specifically, we propose to model the long-term behavior of randomly chosen basis functions (often called cumulants) of the environment state and action, under open-loop control, using what we term Q-basis functions. These Q-basis functions can be easily recombined to infer the true Q function for tasks with arbitrary rewards by simply solving a linear regression problem. Intuitively, this suggests that rather than predicting the evolution of the entire state step by step, predicting the accumulated long-term future of many random features of the state contains information equivalent to a dynamics model, thereby forming an "implicit model" that can transfer. These implicit models scale better with horizon and environment dimensionality than typical one-step dynamics models, while retaining the benefits of transferability and modularity. Our proposed algorithm Random Features for Model-Free Planning (RaMP) allows us to leverage an unlabelled offline dataset to learn reward-agnostic implicit models that can quickly solve new tasks involving different reward functions in the same shared environment dynamics. We show the efficacy of this method on a number of tasks for robotic manipulation and locomotion in simulation, and highlight how RaMP provides a more general paradigm than typical generalizations of modelbased or model-free reinforcement learning.

1.1. RELATED WORK

Model-based RL is naturally suited for this transfer learning setting, by explicitly learning a model of the transition dynamics and the reward function [12, 15, 11, 19, 16, 20, 21] . These models are typically learned via supervised learning on one-step transitions and are then used to extract control actions via planning [22, 23] or trajectory optimization [24, 25, 26] . The key challenge in scaling lies in the fact that they sequentially feed model predictions back into the model for sampling [27, 18, 17] . This can often lead to compounding errors [17, 18, 28] , which grows with the horizon length unfavorably. In contrast, our work does not require autoregressive sampling, but directly models long term behavior, and is easier to scale to longer horizons and higher dimensions. On the other hand, model-free RL often avoids the challenge of compounding error by directly modeling either policies or Q-values [4, 29, 5, 30, 1, 7] and more easily scales to higher dimensional state spaces [1, 31, 5] . However, this entangles rewards, dynamics, and policies, making it challenging to directly use for transfer. While certain attempts have been made at building model-free methods that generalize across rewards, such as goal-conditioned value functions [32, 33, 34, 35] or multi-task policies [36, 37] , they only apply to restricted classes of reward functions and particular training distributions. Our work aims to obtain the best of both worlds (model-based and model-free RL), learning a disentangled representation of dynamics that is independent of rewards and policies, but using a model-free algorithm for learning. Our notion of long-term dynamics is connected to the notion of state-action occupancy measure [38, 39] , often used for off-policy evaluation and importance sampling methods in RL. These methods often try to directly estimate either densities or density ratios [14, 38, 39] . Our work simply learns the long-term accumulation of random features, without requiring any notion of normalized densities. Perhaps most closely related work to ours is the framework of successor features, that considers transfer from a fixed set of source tasks to new target tasks [40, 41, 42, 43] . Like our work, the successor features framework leverages linearity of rewards to disentangle long-term dynamics from rewards using model-free RL. However, transfer using successor features is critically dependent on choosing (or learning) the right featurization and entangles the policy. Our work leverages random features and open-loop policies to allow for transfer across arbitrary policies and rewards.

2. BACKGROUND AND SETUP

Formalism: We consider the standard Markov decision process (MDP) as characterized by a tuple M = (S, A, T , R, γ, µ), with state space S, action space A, transition dynamics T : 1) , and initial state distribution µ ∈ ∆(S). The goal is to learn a policy π : S → ∆(A), such that it maximizes the expected discounted accumulated rewards, i.e., solves S × A → ∆(S), reward function R : S × A → ∆([-R max , R max ]), discount factor γ ∈ [0, max π E π ∞ h=1 γ h-1 r h with r h := r(s h , a h ) ∼ R s h ,a h = Pr(• | s h , a h ). Hereafter, we will refer to an MDP and a task interchangeably. Estimating Q-functions: Given an MDP M, one can define the state-action Q-value function under any policy π as Q π (s, a) := E a h ∼π(• | s h ) s h+1 ∼T (• | s h ,a h ) ∞ h=1 γ h-1 r h s 1 = s, a 1 = a which denotes the expected accumulated reward under policy π, when starting from state-action pair (s, a). Similarly, one can also define the multi-step (τ -step) Q-function Q π (s, a 1 , a 2 , • • • , a τ ) = E a τ +h ∼π(• | s τ +h ) s h+1 ∼T (• | s h ,a h ) ∞ h=1 γ h-1 r h s 1 = s, a 1 = a 1 , a 2 = a 2 , • • • , a τ = a τ . One can estimate the Q π by Monte-Carlo sampling of the trajectories under π, i.e., by solving min Q∈Q 1 N N j=1 Q(s, a j 1 , a j 2 , • • • , a j τ ) - 1 M M m=1 ∞ h=1 γ h-1 r m,j h 2 2 , where Q is some function class for Q-value estimation, which in practice is some parametric function class, e.g., neural networks; r m,j h ∼ R s m,j h ,a m,j h and (s m,j h , a m,j h ) come from M N trajectories that are generated by N open-loop action sequences {( a j 1 , a j 2 , • • • , a j τ )} N j=1 . For each sequence there are M trajectories starting from it, and following policy π onwards, to estimate the τ -step Q-function. A large body of work considers finding this Q-function using dynamic programming, but for the sake of simplicity, this work will only consider Monte-Carlo estimation. In practice, the infinite-horizon estimator in (2.1) can be hard to obtain. We hence use a finitehorizon approximation of Q π (of length H), denoted by Q H π , in learning. Note that if one chooses H = τ , then the τ -step Q-function defined above becomes Q H π (s, a 1 , a 2 , • • • , a H ) := E s h+1 ∼T (• | s h ,a h ) H h=1 γ h-1 r h s 1 = s, a 1 = a 1 , • • • , a H = a H . Note that in this case, the Qfunction is irrelevant of the policy π, denoted by Q H , and is just the expected accumulated reward under the open-loop action sequence ( a 1 , a 2 , • • • , a H ). This Q-function can be used to score how "good" a sequence of actions will be, which in turn can be used for planning.

2.1. PROBLEM SETUP

We consider a transfer and offline RL scenario, where we assume access to an offline dataset consisting of several episodes D = {(s m h , a m h , s m h+1 )} h∈[H],m∈[M ] . This dataset assumes that all transitions are collected under the same transition dynamics T , but otherwise does not require labels for rewards, and may come from multiple different behavior policies as well. Here H is the length of the trajectories, which is large enough, e.g., of order O(1/(1 -γ)) to approximate the infinite-horizon setting; M is the total number of trajectories. The goal is to make the best use of the dataset D, and generalize the learned experience to improve the performance on a new task M, with the same transition dynamics T but arbitrary reward functions R. Note that unlike some related work [40, 44] , we make no assumption on the reward functions of the MDPs that generate D, i.e., these MDPs do not have to share any structure of the reward functions, e.g., being linear in some common features. In fact, the samples of the rewards that correspond to the trajectories in D are not even necessary. The goal of the learning problem is to pre-train on the offline dataset such that we can enable very quick (even zero-shot) adaptation to the new reward functions encountered at test time.

3. RAMP: LEARNING IMPLICIT MODELS FOR CROSS-REWARD TRANSFER WITH MODEL-FREE TECHNIQUES

In this section, we introduce our algorithm, Randomized features for Model-free Planning (RaMP), to solve the problem described in §2 -learning a model of long-term dynamics that enables transfer to tasks labeled with arbitrary new rewards, while mitigating challenges with compounding error.

Random Feature

Small Amount of Interaction Linear Regression

Model Predictive Control Infer

Fit 𝜓 to aggregated feature evolution Exploration Trajectories without Reward We start by arguing where model-based and model-free algorithms fall short. Model-based RL approaches estimate the transition dynamics T using the data in D, and plan in the estimated model. The key advantage of this approach is that it is reward-agnostic, and has the potential to easily generalize to multiple tasks. Unfortunately, since the model outputs are fed back into the model for multi-step planning, it is subject to compounding error of one-step dynamics models [17] . In contrast, one can resort to model-free RL approaches, e.g., Q-learning or policy optimization methods [6, 9, 1, 5, 4] , to directly optimize the value of interest. These methods are less subject to the challenges of compounding error than most model-based ones. Empirically, learning neural networks to predict the Q-function (a scalar for each (s, a)), can be much easier than to predict the next state (which can be a high-dimensional vector, e.g., image). However, these methods cannot be used directly to transfer across different tasks with different rewards, as they are designed to be reward-dependent. This raises the natural question: Is there a model-free approach that can mitigate the challenges of compounding error and can transfer across tasks painlessly? The key insight we advocate is that if instead of modeling long-term accumulation of some specific reward as a Q-function, we directly model long-term accumulation of many random features of state-actions under arbitrary open-loop action sequences. This can effectively disentangle transition dynamics, reward, and policies being evaluated, and potentially allow for transfer across tasks. Each long-term accumulation of random features is referred to as an element of a "random" Q-basis, and can be learned with simple modifications to typical model-free RL algorithms. At training time, the offline dataset D can be used to learn a set of "random" Q-basis functions for different random features. This effectively forms an "implicit model", as it carries information about how the dynamics propagates, without being tied to any particular reward function or policy. At test time, given a new reward function, we can recombine Q-basis functions linearly to effectively approximate the true reward-specific Q-function. This inferred Q-function can then be used for planning for the new task.

3.1. OFFLINE TRAINING: LEARNING RANDOM Q-FUNCTIONS FROM UNLABELLED DATA

Given a dataset of transitions without reward labels, the goal of this phase is to model the long-term accumulation of random features under random state-action sequences. With no prior knowledge about the downstream test-time rewards, the random features being modeled must be expressive and universal in their coverage, so that any possible test-time rewards can be reconstructed from these random features by linear regression. As suggested in [45, 46, 47] , random features can be powerful in representing nonlinear functions, i.e., any test-time reward function in our case, as their linear combinations. In particular, suppose we have K neural networks ϕ(•, •; θ k ) : S × A → R with weights θ k ∈ R d and k ∈ [K] , where θ k are randomly i.i.d. sampled from some distribution p. Sampling K such weights θ k with k ∈ [K] yields a vector of scalar functions [ϕ(•, •; θ k )] k∈[K] ∈ R K for any (s, a), which can be used as random features whose accumulation through dynamics can be used to model Q-basis functions. To model the long-term accumulation of each of these random features, we note that they can be treated as reward functions in model-free RL, and the machinery of Q-functions can be reused to learn their long-term accumulation. As discussed in [48] , model-free RL algorithms can be used to model the evolution of arbitrary functions (called "cumulants") of the state. Therefore, we can learn a set of K Q-basis functions, with each of them corresponding to a particular random feature. We note that this definition of a Q-basis function is tied to a particular policy π that generates the trajectory. To transfer, one needs to predict the accumulated random features under new sequences of actions, as the optimal policy for the new task is likely to not be within the span of policies seen in training. To allow the modeling of cumulants that is independent of particular policies, we propose to learn open-loop Q-basis functions for each of the random features (as discussed in Section §2), which is policy-agnostic, and can be used to search for optimal actions in new tasks. To actually learn these Q-basis functions (one for each random feature), we opt to use Monte-Carlo methods for simplicity. We generate a new dataset D ϕ from D, with D ϕ = {((s m 1 , a m 1:H ), h∈[H] γ h-1 ϕ(s m h , a m h ; θ k ))} m∈[M ],k∈[K] . Here we use  h∈[H] γ h-1 ϕ(s m h , ) : S × A H → R for k ∈ [K], to fit the accumulated cumulants. Specifically, we minimize the following loss min {ν k } k∈[K] 1 M m∈[M ],k∈[K] ψ(s m 1 , a m 1:H ; ν k ) - h∈[H] γ h-1 ϕ(s m h , a m h ; θ k ) 2 . (3.1) These Q-basis functions can be recombined to approximate the Q-functions for true rewards at test time. The two key design decisions here are -(1) predicting the evolution of random features, rather than one-step modeling of state, and (2) predicting the accumulated random features under open-loop action sequences, rather than a closed-loop policy.

3.2. ONLINE PLANNING: INFERRING Q-FUNCTIONS WITH LINEAR REGRESSION AND PLANNING WITH MODEL-PREDICTIVE CONTROL

The goal of our learned Q-basis functions is to enable transfer to new tasks with arbitrary rewards. Any reward function can be approximately expressed as a linear combination of a sufficiently expressive and expansive set of random features. Given this linear approximation, we can recover an approximation to the Q-function for the true test-time reward by recombining the random Q-basis functions linearly. Therefore, we can obtain the test-time Q-function by solving a simple linear regression problem. This inferred Q-function can then be used to obtain an optimal sequence of actions through planning.

3.2.1. REWARD FUNCTION FITTING WITH RANDOMIZED FEATURES

We first learn how to express the reward function for the new task as a linear combination of the random features. This can be done by solving a linear regression problem to find the coefficient vector w = [w 1 , • • • , w K ] ⊤ that approximates the new task's reward function as a linear combination of the random features. Specifically, we minimize the following loss w * = argmin w 1 M H h∈[H],m∈[M ] r(s m h , a m h ) - k∈[K] w k ϕ(s m h , a m h ; θ k ) 2 + λ∥w∥ 2 2 , where λ ≥ 0 is the regularization coefficient, and r(s m h , a m h ) ∼ R s m h ,a m h . Due to the use of random features, Eq. (3.2) is a ridge regression problem, and can be solved efficiently. In case the reward labels are being obtained on the fly during online data collection, we can leverage an online least squares algorithm [49] to continually improve our estimate of w without re-computing regression result from scratch as the number of samples grows. Given these weights, it is easy to estimate an approximate open-loop Q-function for the true reward on the new task by linearly combining the Q-basis functions learned in the offline training phase {ψ(•, •; ν * k )} k∈[K] according to the same coefficient vector w * . This follows from the additive nature of reward and linearity of expectation. In particular, if the reward function r(s, a) = k∈[K] w * k ϕ(s, a; θ k ) holds approximately, which will be the case for large enough K and rich enough ϕ, then the approximate Q-function for the true test-time reward under the sequence {a 1 , • • • , a H } satisfies Q H (s 1 , a 1:H ) := E s h+1∼T (s h ,a h ) h∈[H] γ h-1 R s h ,a h ≈ k∈[K] w * k ψ(s 1 , a 1:H ; ν * k ), where {w * k } k∈[K] is the solution to the regression problem (Eqn (3.2)) and {ν * k } k∈[K] is the solution to the Q-basis fitting problem (Eqn (3.1)).

3.2.2. PLANNING WITH MODEL-PREDICTIVE CONTROL

To obtain the optimal sequence of actions we can use the inferred approximate Q-function for the true reward Q H (s 1 , a 1:H ) for online planning at each time t in the new task: at state s t , we conduct standard model-predictive control with random shooting, i.e., randomly generating N sequences of actions {a n 1 , • • • , a n H } n∈[N ] , and find the action sequence with the maximum Q-value such that n * t ∈ argmax n∈[N ] k∈[K] w * k ψ(s t , a n t:t+H-1 ; ν * k ). We then execute a n * t t from the sequence n * t , observe the new state s t+1 , and replan. Our algorithm is summarized in Algorithm 1. We refer readers to Appendix D for a detailed connection of our proposed method to existing work and Appendix A for detailed pseudocode.

3.3. THEORETICAL JUSTIFICATIONS

We now provide some theoretical justifications for the methodology we adopt. To avoid unnecessary nomenclature of measures and norms in infinite dimensions, we in this section consider the case that S and A are discrete (but can be enormously large). Due to space limitation, we present an abridged version of the results below, and defer the detailed versions and proofs in §C. We first state the following result on the expressiveness of random cumulants. Theorem 3.1 (Q-function approximation; Informal). Under standard coverage and sampling assumptions of offline dataset D, and standard assumptions on the boundedness and continuity of random features ϕ(s, a; θ), it follows that with horizon length H = Θ( log(Rmax/ϵ) 1-γ ) and M = Θ( 1(1-γ) 3 ϵ 4 ) episodes in dataset D, and with K = Ω((1 -γ) -2 ϵ -2 ) random features, we have that for any given reward function R, and any policy π ∥ Q H π (w * ) -Q π ∥ ∞ ≤ O(ϵ) + O inf f ∈H E(f ) with high probability, where for each (s, a), Q H π (s, a; w * ) is defined as Q H π (s, a; w * ) := E H h=1 γ h-1 k∈[K] w * k ϕ(s h , a h ; θ k ) s 1 = s, a 1 = a , and can be estimated from the offline dataset D; inf f ∈H E(f ) is the infimum expected risk over the function class H induced by ϕ. Theorem 3.1 is an informal statement of the results in §C.2, which specifies the number of random features, the horizon length per episode, and the number of episodes, in order to approximate Q π accurately by using the data in D, under any given reward function R and policy π. Note that the number of random features is not excessive and is polynomial in problem parameters. We also note that the results can be improved under stronger assumptions of the sampling distributions p and kernel function classes [47, 50] . Next, we justify the use of open-loop Q-functions in planning in a deterministic transition environment, which contains all the environments our empirical results we will evaluate later. Recall that for any given reward R, let Q π denote the Q-function under policy π. Note that with a slight abuse of notation, Q π can also be the multi-step Q-function (see definition in §2), and the meaning should be clear from the input, i.e., whether it is Q π (s, a) or Q π (s, a 1 , • • • , a H ). Theorem 3.2. Let Π be some subclass of Markov stationary policies, i.e., for any π ∈ Π, π : S → ∆(A). Suppose the transition dynamics T is deterministic. For any given reward R, denote the Hstep policy obtained from H-step open-loop policy improvement over Π as π ′ H :  S → A H , defined as π ′ H (s) ∈ argmax (a1,••• ,a H )∈A H max π∈Π Q π (s, a 1 , • • • , a H ), for all s ∈ S. Let V π ′ H denote the V π ′ H (s) ≥ max a 1:H max π∈Π Q π (s, a 1 , • • • , a H ) ≥ max a max π∈Π Q π (s, a). The proof of Theorem 3.2 can be found in §C.2. The result can be viewed as a generalization of the generalized policy iteration result in [40] to multi-step open-loop policies. Taking Π to be the set of policies that generate the data, the result shows that the value function of the greedy open-loop policy improves over all the possible H-step open-loop policies, with the policy after step H to be any policy in Π. Moreover, the value function by π ′ H also improves overall one-step policies if the policy after the first step onwards follows any policy in Π. This is due to the fact that Π (coming from data) might be a much more restricted policy class than any open-loop sequence a 1:H .

4. EXPERIMENTAL EVALUATION

In this section, we aim to answer the following research questions: (1) Does RaMP allow for effective transfer of behaviors across tasks with varying rewards but shared dynamics?, (2) Does RaMP scale to domains with high dimensional observation spaces and longer horizons?, (3) Does RaMP scale to domains with high dimensional action space? (4) Which design decisions in RaMP enable better transfer and scaling?

4.1. EXPERIMENT SETUP

Across several domains, we evaluate the ability of RaMP to leverage the knowledge of shared dynamics from an offline dataset to quickly solve new tasks with arbitrary rewards. Offline Dataset Construction: For each domain, we have an offline dataset collected by a behavior policy as described in Appendix B.2. Typically this behavior policy is a mixture of noisy policies accomplishing different objectives in each domain. Although RaMP and other model-based methods do not require knowledge of any reward from the offline dataset and simply require transitions, other baseline comparisons will require privileged information. Baseline comparison methods like modelfree RL and successor features require the provision of a set of training objectives, as well as rewards labeled for these objectives on state-actions from the offline dataset. We call such objectives 'offline objectives' and a dataset annotated with these offline objectives and rewards a privileged dataset. Test-time adaptation: At test-time, we select a novel reward function for online adaptation, referred to as 'online objective' below. The online objective may correspond to rewards conditioned on different goals or even arbitrary rewards, depending on the domain. Importantly, the online objective need not be drawn from the same distribution as the privileged offline objectives above. Given this problem setup, we compare RaMP with a variety of baselines. (1) MBPO [13] is a modelbased reinforcement learning method that learns a standard one-step dynamics model and uses actorcritic methods to plan in the models. We pre-train the dynamics model for MBPO on the offline dataset before running the full algorithm on the testing environment. (2) Successor feature (SF) [40] is a framework for transfer learning in RL as described in Sec. 1.1. SF typically assumes access to a set of policies towards different goals along with a learned featurization, so we provide it with the privileged dataset to learn a set of policies corresponding to the offline objectives using offline reinforcement learning [51] . We also learn successor features with the privileged dataset [40] . (3) CQL [51] : As an oracle comparison, we compare with a goal-conditioned variant of an offline RL algorithm (CQL). CQL is a model-free offline RL algorithm that learns policies from offline data. While model-free offline RL naturally struggles to adapt to arbitrarily changing rewards, we instead afford CQL additional privileges by providing it with information about the goal at both training and testing time. CQL is then trained on the distribution of training goals on the offline dataset, and finetuned on the new goal provided at test time. In this sense, the CQL comparison is assuming access to more information than RaMP. Each method is benchmarked on each domain with 9 seeds.

4.2. TRANSFER TO NOVEL REWARDS

We first evaluate the ability of RaMP to learn from an offline dataset and quickly adapt to novel test rewards in 4 robotic manipulation environments from meta-world [52] . We consider skills like reaching a target across the wall, opening a door, turning on a faucet, and pressing a button while avoiding obstacles, which are challenging for typical model-based RL algorithms (Fig. 2 ). Each domain features 50 different possible goal configurations, each associated with a different reward but the same dynamics. The privileged offline objectives consist of 25 goal configurations as described in Sec.4.1. The test-time reward functions are drawn from the remaining 25 "out-ofdistribution" reward functions. We refer the reader to Appendix B.1 for details of this setup. As shown in Fig 3 , our method adapts to test reward most quickly across all four domains. MBPO slowly catches up with our performance with more samples, since it still needs to learn the Q function from scratch even with the dynamics branch trained. In multiple environments, successor features barely transfer to the online objective as it entangles policies that aren't close to those needed for the online objective. Goal-conditioned CQL performs poorly in all tasks as it faces a hard time generalizing to out-of-distribution goals. In comparison, RaMP is able to deal with arbitrary sets of test time rewards, since it does not depend on the reward distribution at training time.

4.3. SCALING TO TASKS WITH LONGER HORIZONS

We further evaluate the ability of our method to scale to tasks with longer horizons. We consider locomotion domains such as the Hopper environment from OpenAI Gym [53] . We chose the offline objectives to be running forward at different velocities. The online objectives for adaptation correspond to novel skills such as standing, sprinting, jumping, or running backward. Among them, standing and sprinting are goal-conditioned objectives that correspond to running forward at zero and maximum speed, while jumping and running backward have drastically different objectives that are difficult to express as parametric "goals". Therefore, goal-conditioned methods like CQL are not applicable on jumping and running backward. As shown in Fig. 3 , our method maintains the highest performance when adapting to drastically different online objectives, as it is designed to make no assumption about reward in the offline dataset, while avoiding compounding errors by directly modeling accumulated random features. MBPO fails to match the performance of RaMP since higher dimensional observation and longer horizon increase the compounding error of model-based methods. We note that SF is performing reasonably well, likely because the method also reduces the compounding error compared to MBPO. Furthermore, its featurization is trained with privileged data and thus still captures useful information for the online objectives. In Appendix B.4, we further test our method on environments with even higher dimensional observations, such as image observations. We refer the reader to Appendix B.1 for further details.

4.4. SCALING TO HIGH DIMENSIONAL STATE-ACTION SPACES

To understand whether RaMP can scale to higher dimensional state-action spaces, we consider a dexterous manipulation domain (referred to as the D'Claw domain in Fig 2) . This domain has a 9 DoF action space controlling each of the joints of the hand as well as a 16-dimensional state space including object position. The offline dataset is collected moving the object to different orientations, and the test-time rewards are tasked with moving the object to new orientations (as described in Appendix B.1). Fig 3 shows that both Q-estimation and planning with model-predictive control remain effective when action space is large.

4.5. ABLATION OF DESIGN CHOICES

To understand what design decisions in RaMP enable better transfer and scaling, we conduct ablation studies on various domains, including an analytical 2D point goal-reaching environment and its variants (described in Appendix B.1), as well as the classic InvertedDoublePendulum domain and meta-world reaching. We report an extensive set of ablations in Appendix B. 

Reduction of compounding error with open-loop Q functions

We hypothesize that our method does not suffer from compounding errors in the same way that feedforward dynamics models do. In Table 1 , we compare the approximation error of truncated Q values computed with (1) open-loop Q functions obtained as a linear combination of random cumulants (Ours), and (2) rollouts of a feedforward dynamics model (MBRL). We train the methods on offline data and evaluate on data from a novel task at test time. Note that this setting is analogous to performing policy evaluation on the behavioral policy induced by an offline dataset. We perform the comparison on one environment with high action dimension (Hopper) and one with chaotic dynamics (Pendulum). As shown in Table 1 , our method outperforms feedforward dynamics models.

Effect of different types of featurization

We experiment with three choices of projections: random features parametrized by a deep neural network, random features parametrized by a gaussian matrix, and polynomial features of state and action up to the second order. We evaluate these choices on Point and Metaworld Reach in Table 2 . We see that NN-parametrized random features approximate the true reward well as a linear combination of random features for all three tasks. Polynomial features perform well on environments with simple rewards that are linear in polynomial basis, but struggle as the reward becomes more complex, while Gaussian features are rarely expressive enough.

5. DISCUSSION

In this work, we introduce RaMP, a method for leveraging diverse prior offline data to learn models of long horizon dynamics behavior, while being able to naturally transfer across tasks with different reward functions. To do so, we combine the best elements of model-based and model-free reinforcement learning. By learning the long-term evolution of random features under open loop policies, we are able to disentangle dynamics, rewards, and policies. We show how this technique allows us to learn behavior that naturally transfers across tasks, even under misspecification of reward functions. Across a number of simulated robotics and control domains, RaMP achieves superior transfer ability than baseline comparisons. In future work, we hope to explore how to combine RaMP with more powerful planning methods like [54] and dynamic program techniques for learning Q-basis functions [6] .

Supplementary Materials for "Model-free Reinforcement Learning that Transfers using Random Features"

A ALGORITHM PSEUDOCODE  D ϕ = (s m 1 , a m 1:H ), h∈[H] γ h-1 ϕ(s m h , a m h ; θ k ) m∈[M ],k∈[K] . 4: Fit random Q-basis functions ψ(•, •, ν k ) : S × A H → R for k ∈ [K] by minimizing the loss over the dataset D ϕ , ν * k k∈[K] ∈ argmin {ν k } k∈[K] 1 M m∈[M ],k∈[K] ψ(s m 1 , a m 1:H ; ν k ) - h∈[H] γ h-1 ϕ(s m h , a m h ; θ k ) 2 . 5: Online Planning Phase: 6: Fit the testing task's reward function r(•, •) with linear regression on random features: w * ∈ argmin w 1 M H h∈[H],m∈[M ] r(s m h , a m h ) - k∈[K] w k ϕ(s m h , a m h ; θ k ) 2 + λ∥w∥ 2 2 where r(s m h , a m h ) ∼ R s m h ,a m h . 7: Sample s 1 ∼ µ 0 8: for time index t = 1, • • • do 9: Randomly generate N sequences of actions {a n 1 , • • • , a n H } n∈[N ] 10: Find the best sequence such that  n * t ∈ argmax n∈[N ] k∈[K] w * k ψ(s t ,

B ADDITIONAL EXPERIMENTS AND SETUP DETAILS

In this section, we provide more details of the experiments, including more detailed setup and supplementary results.

B.1 DESCRIPTION OF ENVIRONMENTS

We describe the details of all used environments such as observation space, action space, reward, offline / online objectives, and dataset collection. Meta-World All our meta-world [52] domains share the standard meta-world observation which includes gripper location, and object locations of all possible objects involved in the Metaworld benchmark. Although the observation space has 39 dimensions, each domain only uses one or two objects so only 7 dimensions are changing any each domain we chose. For pixel observation variants of each domain, we concatenate two 84 × 84 × 3 RGB images from two views, with a resulting observation dimension of 42336. Each domain has a 4 dimensional action space, corresponding to the delta movement of end-effector in the workspace along with the delta movement of the gripper finger. Metaworld provides a set of 50 goal configurations for each domain. We collect offline dataset following the procedure described in Sec. B.2. The online objective is chosen to be a novel configuration that isn't any of the 50 offline goal configurations. To create the privileged dataset, we choose 25 of the goal configurations as offline objectives. These chosen configurations are the furthest 25 goals from the online objective in the Euclidean distance. We evenly annotate the offline dataset with rewards for each of these goals to form a privileged dataset such that the online objective is out of the distribution of the offline objectives. For different configurations of the same domain, since object locations are in observation and the goal configuration isn't in it, the dynamics is the same. We now describe the objectives of all used meta-world domains, including those used in the appendix.

1.. Reach across Wall

The objective is to reach a target across a wall. The location of the target is randomized for each goal configuration.

2.. Open Door

The objective is to open the door to a certain angle by putting the end-effector between the door handle and the door surface. For each configuration, the location of the door is randomized.

3.. Turn on Faucet

The objective is to turn on a faucet by rotating it along an axis. For each goal configuration, the location of the faucet is randomized.

4.. Press Button

The objective is to press a button horizontally. The location of the button is randomized for each goal configuration. Hopper Hopper is an environment with a higher dimensional observation of 11 dimensions and an action space dimension of 3. Hopper is a locomotion environment that requires long-horizon reasoning since a wrong action will make it fall down and touch the ground only after some steps. The objective of the original Hopper enviroment in OpenAI gym is to train it to run forward. To analyze the performance of CQL and SF on Hopper, we modify the objective to a goal-conditioned variant such that the agent is trained to follow a certain velocity specified by its goal. Similar modifications are common in meta reinforcement learning such as in [55] . We sample a set of 50 goal velocities between 0 and 1.0 as offline objectives to collect data in the same way as we did in Metaworld environment. For online transfer, we choose four domains with four different online objectives. All the variant domains of Hopper share the same dynamics.

1.. Hopper Stand

The objective is to stand still and upright. This online objective is on the boundary of the offline objectives' range.

2.. Hopper Sprint

The objective is to run forward at a speed of 1.5, which is out of the distribution of offline objectives ranging from 0 to 1.0. The reward function remains the same as that for offline objectives. Only the goal changes.

3.. Hopper Backward

The objective is to run backward at a target speed of 0.5, which is in the opposite direction of the offline objective. Falling down to the ground is penalized.

4.. Hopper Jump

The objective is to jump to a height of 1.5, without moving forward or backward. Such height is typically not achievable when running forward, so this objective is drastically different from offline objectives. D'Claw D'Claw environment is a dexterous manipulation environment with 24 dimensions of observation space and 9 dimensions of action space. The 9 dimensions correspond to the 9 joints of a robot hand with 3 fingers and 3 joints on each finger. The objective of the environment is to control the hand to rotate a rotating tripod to a certain angle. Angular distance to the target angle is used as the offline objective. To increase the degree of freedom and make the environment more challenging, we allow the tripod to not only rotate but also translate freely on the 2d plane. The initial rotation and position of thetripod are randomized upon every episode. We collect the offline dataset in the same way as in meta-world, training on 50 offline objectives and using ϵ-greedy to collect rollouts. At test time we choose a new offline objective angle and annotate the rewards of the privileged dataset in the same way we did for goal conditioned Metaworld environments. Analytical 2D Point Point is a 2D point goal-reaching environment with linear dynamics. The reward is defined as the distance to goal minus an action penalty. The offline objectives are negative distances to the randomly selected goals on the 2d plane. The online objectives are novel goals on the plane. Since we are not evaluating CQL and SF on this environment, we don't generate the privileged dataset for 2D point. Point Perturbed Point Perturbed shares the same linear dynamics as Point, but features unsafe regions with negative rewards or local maxima with small positive rewards. These perturbations represent out-of-distribution test objectives that cannot be well approximated by a low-dimensional feature vector. Note that the added perturbations make point perturbed no longer an instance of analytical 2D point environment with a different goal. Instead, it features online objectives that are completely in a different class from 2D point.

B.2 DESCRIPTION OF ALGORITHM TRAINING DETAILS

For each domain, we first train 50 policies with SAC [7] . Each policy is trained towards some offline objective of the domain described in Sec. B.1 for 50000 steps. We then use an ϵ-greedy version of each trained policy to collect 32000 data points for each domain per offline objective. We choose ϵ = 0.5. Such a procedure ensures the dataset has reasonable coverage of the entire state-action space. We note training these policies are fully optional, since RaMP only trajectories without rewards. Datasets collected via intrinsic rewards like curiosity would totally suffice. We choose the random feature dimension to be 2048. Each dimension in the random feature is extracted by feeding state-action tuple to a randomly initialized MLP with 2 hidden layers of size of 32. There are therefore 2048 independent, small random MLPs to extract ϕ. All state-action tuples are projected to reward basis ϕ with it. During offline training phase, we ensemble 8 instances of MLP with 2 hidden layers of size 4096 and train ψ network following Sec. 3.1. We train ψ network with a learning rate of 3 × 10 -4 on the offline dataset for 4 epochs, with a γ decay of 0.99 and batch size 128. We choose the horizon H to be 10 for meta-world and D'Claw environments and 32 for Hopper environments. During online adaptation phase, we first do random exploration for 2457 steps to collect enough data points for linear regression. When doing linear regression, we always concatenate a bias dimension to ψ. For each MPC rollout, we randomly sample 1024 action sequences. We penalize the predicted reward with 0.16 of the variance of predictions from all 8 ensembles. Since online least square makes recomputing ω regression fast, we perform update of weight vector every single step after initial random exploration is finished. Due to page limit, we omitted the plot for Hopper Stand in Fig. 3 . Here we provide additional results of RaMP and baselines for it in Fig. 4 . The result is consistent with our analysis in the main paper. RaMP outperforms the baselines just in other Hopper variants. One major difference here is that CQL is performing well for HopperStand. This is likely because the online objective of Hopper Stand is at the boundary of offline objectives as described in Sec. B.1. Given that offline objectives are running at target velocities, the CQL likely learns to not fall down even if the online objective is out of distribution. By not falling down alone, CQL is capable of maintaining a good reward as seen in this case.

B.4 SCALING TO HIGH-DIMENSIONAL PIXEL OBSERVATION

In Sec.4.3, we evaluate RaMP's ability to scale to environments with high dimensional observations. In this section, we go a step further by significantly increasing the dimension of the observation space to 42336 as described in Sec. B.1. We use a CNN encoder following the architecture in [56] followed by 2 layer MLP as the random feature network. Both CNN and MLP layers are randomly initialized. Action is projected and concatenated to the first layer of MLP so the random feature would still be conditioned on both action and observation. We compare our method against the Dreamer [19] , the state of art model-based learning method optimized for pixel observations. Similar to MBPO, we pre-train dreamer's dynamics branch with offline data before the online phase. As shown in Fig. 5 , our method is able to achieve a similar level of performance as Dreamer in two meta-world environments, Pixel Reach and Pixel Faucet Open. RaMP does not see a significant return drop from the variant of the environment with state observation. Given the efficacy of Dreamer, the result still shows RaMP's performance can scale up to pixel observation. However, Dreamer is able to outperform RaMP significantly in an environment like Pixel Door Open. This is likely because random features capture the change in input space rather than reward space. Pixel observations can give random features a lot of noise while important semantic features may not correspond to a big change in pixel space. We note that instead of using random convolution layers, we can use pre-trained encoders to achieve better semantic feature extraction and significantly improve the quality of random features. This is beyond the scope of our work and we leave this for future works.

B.5 ADDITIONAL ABLATIONS

Our method builds on the assumption that a linear combination of high dimensional random features can approximate arbitrary reward functions. In the following ablation experiments, we validate this assumption both quantitatively and qualitatively. Effect of random feature dimension We evaluate our method on Point, Point Perturbed, and Metaworld Reach using {128, 256, 512, 1024, 2048, 4096} random features. Results are summarized in Table 3 . We find that performance degrades with smaller random feature dimensions because the features are unable to linearly approximate the true reward. On the other hand, Fig. 6 shows that high dimensional random features experience slower adaptation. 

Scaling with state dimension

In Table 4 we evaluate our method on point reaching environments with 2, 3, and 4 state dimensions. All three environments feature distance-to-goal minus action penalty as the reward. We compute the return error stemming from linear regression as well as the Q error stemming from both linear regression and function approximation. We see an increase in both return error and Q error with higher state dimensions, but the overall approximation errors remain reasonably low. Nonlinear approximation capability In Fig. 8 , we visualize the truncated Q value obtained from a linear combination of the random cumulants and compare it to the ground truth Q value approximated by Monte-Carlo sampling. We perform the comparison with Point and Point Perturbed environments. Our method provides an accurate estimate of the Q value even in the face of out-of-distribution and highly nonlinear rewards.

Point

Finally, we compare the performance of RaMP to baselines on in-distribution online objectives. Our method is designed to make no assumptions about test objectives during policy transfer. As shown in Fig. 3 , RaMP outperforms CQL and SF when the online objectives are out of distribution. A natural question to ask is how things will change when the setting satisfies the assumptions of offline-online objective correlation. For example, in a 2D reaching environment, the training dataset may be annotated with either rewards corresponding to only goals on the right half or goals covering the entire plane. When the online objective is to reach a goal on the left half of the plane, it will be out of distribution for the first case while being in distribution for the second. When we curate the labeling process of the privileged dataset to satisfy the in-distribution assumption, CQL and SF receive a significant performance boost. As shown in Fig. 7 , the performance of our method and MBPO are unaffected as neither algorithm depends on offline objectives. CQL, on the other hand, matches the performance of our method under this new setting. This serves as a foil to the generalization of our method to out-of-distribution online objectives.

B.6 MPPI EXPERIMENTS

We provide additional results on model-predictive control via model-predictive path integral (MPPI) [54] in Fig. 9 . MPPI is an sampling-based trajectory optimization method which maintains a distribution of action sequence initialized as isotropic standard Gaussian. In each iteration, MPPI draws n action sequences from the current distribution and computes their values, which we do using the learned Q-basis networks and online regression weights. MPPI then updates the distribution using the weighted mean and standard deviation of the sampled trajectories, where the weights are computed as the softmax of the values multiplied by a temperature parameter γ. As shown in Fig. 9 , MPPI improves the performance of our method across two Metaworld environments and the D'Claw environment, thus indicating that our method can benefit from powerful planning algorithms. In these experiments, we perform 10 optimization steps and sample n = 1000 trajectories in each step. We use γ = 10 for Metaworld and γ = 50 for D'claw.

B.7 FINETUNING EXPERIMENTS

While we freeze the Q-basis networks at test time in our main experiments to demonstrate transfer behavior, we can in fact finetune the Q-basis networks to continuously improve our estimate of the Q-value. After performing reward regression for a number of steps, we can finetune the Q-basis networks on online trajectories by fitting the predicted Q-values to the Monte-Carlo Q values and allowing the gradients to flow through the regression weights. We conduct finetuning experiments on two Metaworld environments and the D'Claw enironments. As shown in Fig. 10 , our method sees a noticeable performance increase with finetuning starting at 6400 steps. 

C DETAILED THEORETICAL RESULTS

In this section, we provide the formal statement of the theoretical insights given in §3.3, and corresponding proofs.

C.1 FORMAL STATEMENT

Theorem C.1. Suppose the offline data in D are generated from some distribution ρ ∈ ∆(S × A), h , a m h ) ∼ ρ(•, •) and s m h+1 ∼ T for all (m, h) ∈ [M ] × [H], = ρ > 0. Suppose θ k ∼ p(•) for all k ∈ [K], | ≤ κ for some κ > 0 and ϕ(•, •; θ) is continuous. For some large enough n := M H, letting λ = n -1/2 , we have that if K = Ω( √ n log(κ 2 √ n/δ)) , then with probability at least 1 -δ, for any given reward function R ∥ Q π (w * ) -Q π ∥ ∞ ≤ 1 1 -γ 1 ρ inf f ∈H (s,a)∈S×A r -f (s, a) 2 dR s,a (r)ρ(s, a) E(f ) +O log(1/δ) √ n for any policy π, where H := {f = ϕ(•, •; θ)w(θ)dp(θ) | |w(θ)| 2 dp(θ) < ∞}, w * is the solution to (3. 2), and Q π (s, a; w * ) := E ∞ h=1 γ h-1 k∈[K] w * k ϕ(s h , a h ; θ k ) s 1 = s, a 1 = a . (C.1) The proof of Theorem C.1 can be found in §C.2. It shows that with large enough amount of random features, the Q-function of any reward function R, under any policy π, can be approximated accurately up to some inherent error related to the richness of the function class that the features can represent. Note that we here only state the results under some mild and basic assumptions from the random features literature, and are by no means tight. They can be improved in various ways, for example, if the sampling distribution of θ, p, can be data-dependent, and some stronger assumptions on the data and kernel function classes [47, 50] . Corollary C.2. Suppose the assumptions in Theorem C.1 hold, and additionally the kernel induced function space H is rich enough such that inf f ∈H E(f ) = 0. Then, with horizon length H = Θ( log(Rmax/ϵ) 1-γ ) and M = Θ( 1(1-γ) 3 ϵ 4 ) episodes in dataset D, and with K = Ω((1 -γ) -2 ϵ -2 ) random features, we have ∥ Q H π (w * )-Q π ∥ ∞ ≤ O(ϵ) for any π, where for each (s, a), Q H π (s, a; w * ) is a H-horizon truncation of (C.1), which can be estimated from the offline dataset D. The proof of Corollary C.2 can be found in §C.2, which specifies the number of random features, the horizon length per episode, and the number of episodes, in order to approximate Q π accurately by using the data in D. Note that the number of random features is not excessive and is polynomial in problem parameters. Combining Theorem C.1 and Corollary C.2 leads to the informal statement in Theorem 3.1. Next, we justify the use of open-loop Q-functions in planning in a deterministic transition environment, which contains all the environments our empirical results we have evaluated. Recall that for any given reward R, let Q π denote the Q-function under policy π. Note that with a slight abuse of notation, Q π can also be the multi-step Q-function (see definition in §2), and the meaning should be clear from the input, i.e., whether it is Q π (s, a) or Q π (s, a 1 , • • • , a H ). Theorem C.3. Let Π be some subclass of Markov stationary policies, i.e., for any π ∈ Π, π : S → ∆(A). Suppose the transition dynamics T is deterministic. For any given reward R, denote the Hstep policy obtained from H-step open-loop policy improvement over Π as π ′ H : S → A H , defined as π ′ H (s) ∈ argmax (a1,••• ,a H )∈A H max π∈Π Q π (s, a 1 , • • • , a H ), for all s ∈ S. Finally, define the value-function under  π ′ H as V π ′ H (s) := Q π ′ H (s, π ′ H (s)), where Q π ′ H (s, max π∈Π Q π (s, a 1 , • • • , a H ) ≥ max a max π∈Π Q π (s, a). C.2 DEFERRED PROOFS C.2.1 PROOF OF THEOREM C.1 The proof relies on the result of generalization guarantees of learning from random features, with squared loss. Note that one cannot use the results in [46] , which dealt with Lipschitz loss function of the form c(y ′ , y) = c(y ′ y). This does not include the squared loss we used in our experiments. Instead, we resort to the results in [47] , which also yield a better statistical rate. For the sake of completeness, we re-state the abridged version of one key result therein, Theorem 1 in [47] , as follows. Lemma C.4. Suppose that K is a kernel with an integral representation K(x, x ′ ) = Ω ψ(x, w)ψ(x ′ , w)dp(w), where (Ω, p) is a probability space and ψ : X × Ω → R, where X is a separable space. Suppose ψ is continuous and |ψ(x, w)| ≤ κ with κ ∈ [1, +∞) almost surely, and |y| ≤ b almost surely. Define the expected risk: E(f ) := (f (x) -y) 2 dρ(x, y), where ρ is the distribution where the data samples (x i , y i ) n i=1 . Define the solution to kernel ridge regression with M random features as f λ,M (x) = ϕ M (x) ⊤ w λ,M , with w λ,M := ( S ⊤ M S M + λI) -1 S ⊤ M y, (C.2) where ϕ M (x) := ψ(x, w 1 ), ψ(x, w 2 ), • • • , ψ(x, w M ) / √ M , w i are drawn i.i.d. from p(•), y := (y 1 , • • • , y n )/n 1/2 , S ⊤ M := ϕ M (x 1 ), • • • , ϕ M (x n ) /n 1/2 . Then, suppose n ≥ n 0 , λ = 1/n 1/2 , and the number of features M ≥ c 0 √ n log(108κ 2 √ n/δ), we have that with probability at least 1 -δ, E( f λ,M ) -min f ∈H E(f ) ≤ c 1 log 2 (18/δ) √ n , where n 0 , c 0 , c 1 are absolute constants, H is the reproducing kernel Hilbert space corresponding to the kernel K. We then apply Lemma C.4, with (x, y) in the lemma being replaced by (s, a), r(s, a) , ρ(x, y), p, x, w, M, λ in the lemma being replaced by ρ(s, a) • R s,a , p, (s, a), θ, K, λ in our case. Note that Lemma C.4 requires the space X to be separable, and our finite space S ×A satisfies; it requires |y| bounded, and our reward is absolutely bounded by R max , and thus also satisfies. We hence obtain that with probability at least 1 -δ, if the number of random features K ≥ Ω( √ n log( √ n)), with n := HM , then E (s,a)∼ρ(•,•),r∼Rs,a(•) r - k∈[K] w * k ϕ(s, a; θ k ) 2 ≤ inf f ∈H E(f ) + O log(1/δ) √ n (C.3) where we note that w * = (w * 1 , • • • , w * K ) is the solution to (3.2), and the E(f ) here is defined in Theorem C.1. For any policy π for the MDP, let Q π denote the Q-function under policy π and the actual reward function distribution R, and Q π (w * ) denote the Q-function under the estimated reward using random features: By Bellman equation, we have that for each (s, a) Q π (s, a; w * ) := E ∞ h=1 γ h-1 r(s h , a h ; w * ) s 1 = s, a 1 = a , Q π (s, a) -Q π (s, a; w * ) = rdR s,a (r) + γ s ′ ,a ′ Q π (s, a)T (s ′ | s, a)π(a ′ | s ′ ) -r(s, a; w * ) -γ s ′ ,a ′ Q π (s, a; w * )T (s ′ | s, a)π(a ′ | s ′ ) ≤ rdR s,a (r) -r(s, a; w * ) + γ • Q π -Q π (w * ) ∞ . Taking sup over s, a and organizing the terms, we have Q π -Q π (w * ) ∞ ≤ 1 1 -γ • sup s,a rdR s,a (r) -r(s, a; w * )  ≤ 1 1 -γ • s,a rdR s,a (r) -r(s, a; w * ) 2 (C.5) ≤ 1 1 -γ • 1 ρ s,a rdR s,a (r) -r(s, a; w * ) 2 ρ(s, a) (C.6) = 1 (1 -γ) √ ρ • E (s, Q π -Q π (w * ) ∞ ≤ 1 (1 -γ) √ ρ • E (s,a)∼ρ(s,a),r∼Rs,a(•) rdR s,a (r) -r(s, a; w * ) 2 ≤ 1 (1 -γ) √ ρ • inf f ∈H E(f ) + O log(1/δ) √ n , which completes the proof.

C.2.2 PROOF OF COROLLARY C.2

First, note that with H = Θ( log(Rmax/(ϵ(1-γ))) 1-γ ) ensures that ∥ Q H π (w * ) -Q π (w * )∥ ∞ ≤ O(ϵ) , which can be obtained by the boundedness of r(s, a) by R max , and the fact that γ H R max 1 -γ = (1 -(1 -γ)) 1 1-γ •H(1-γ) R max 1 -γ ≤ 1 e log(Rmax/(ϵ(1-γ))) R max 1 -γ = ϵ. Furthermore, since Theorem C.1 requires n = HM = Θ( 1 (1-γ) 4 ϵ 4 ), to make sure ∥ Q π (w * ) - Q π ∥ ∞ ≤ ϵ. Combining these facts yields the desired result. C.2.3 PROOF OF THEOREM 3.2 Define Q max H (s, a 1 , • • • , a H ) := max π∈Π Q π (s, a 1 , • • • , a H ), and Q max (s, a) := max π∈Π Q π (s, a). We also define the Bellman operator under the open-loop policy π ′ H as follows: for any  Q ∈ R |S|×|A H | T H,π ′ H (Q)(s, a 1 , • • • , a H ) = E h∈[H] γ h-1 r(s h , a h ) + γ H Q(s H+1 , π ′ H (s H+1 )) s 1 = s, in Q H,π ′ H (s, a 1:H ). Note that T H,π ′ H (Q max H )(s, a 1 , • • • , a H ) = E h∈[H] γ h-1 r(s h , a h ) + γ H Q max H (s H+1 , π ′ H (s H+1 )) s 1 = s, a 1:H = E h∈[H] γ h-1 r(s h , a h ) + γ H max a H+1:2H Q max H (s H+1 , a H+1:2H ) s 1 = s, a 1:H (C.9) ≥ E h∈[H] γ h-1 r(s h , a h ) + γ H max a H+1:2H Q π (s H+1 , a H+1:2H ) s 1 = s, a 1:H (C.10) ≥ E h∈[H] γ h-1 r(s h , a h ) + γ H Q π (s H+1 , π(s H+1 ) • • • , π(s 2H )) s 1 = s, a 1:H , (C.11) = Q π (s, a 1 , • • • , a H ), ) = lim k→∞ (T H,π ′ H ) k (Q max H )(s, a 1:H ) ≥ Q max H (s, a 1:H ) ≥ max π∈Π Q π (s, a 1:H ). (C.13) Notice that for all s, by applying π ′ H (s) on both sides of (C.13), V H,π ′ H (s) = Q H,π ′ H (s, π ′ H (s)) ≥ max π∈Π Q π (s, π ′ H (s)) = max a 1:H max π∈Π Q π (s, a 1:H ). (C.14) Further, due to the multi-step maximization, we have max a 1:H max π∈Π Q π (s, a 1 , • • • , a H ) ≥ max a max π∈Π Q π (s, a), which, combined with (C.14), completes the proof.

D RELATIONSHIP TO EXISTING WORK

We briefly connect our proposed algorithm to prior work. Successor features for transfer in RL: While successor features [40, 44] have shown the ability to transfer across problems in RL, the two key differences in our framework are (1) using random features rather than requiring learned or pre-provided features and ( 2) training open-loop Q functions rather than typical Q π . These two changes allow transfer to happen across a broader class of reward functions and not simply be restricted to the policy cover experienced at training time. Model-based RL: Our work is connected to model based RL in that it disentangles dynamics and rewards, but is crucially different in that it doesn't model one-step evolution of state but rather long term accumulation of random features. This trades off compounding error for generalization error. Model-free RL: Our work is connected to methods for model-free RL in that it also models a set of Q-functions, but importantly this doesn't correspond to a particular reward, but rather to random features of state. By doing so, we are able to adapt to arbitrary rewards at test-time, rather than being tied to a particular reward function.

E INFINITE-HORIZON Q-FUNCTION VARIANT

While our setup in Section 2 uses a finite-horizon Q H π to approximate Q π , our method can also plan with an infinite-horizon Q function during the online phase. In this section, we describe one compatible way to learn an infinite-horizon Q-function while still enjoying the benefits of RaMP in the case with deterministic transition dynamics. We also present empirical results and analysis of this variant.

E.1 METHOD

We first notice that an infinite-horizon Q-function Q π can be decomposed into the discounted sum of an H-step reward and a discounted value function under policy π evaluated at s t+H : Q π (s ′ , a ′ ) = E at∼π(• | st) st+1∼T (• | st,at) γ H V π (s H+1 ) + H t=1 γ t-1 r(s t , a t ) s 1 = s ′ , a 1 = a ′ where V π (s ′ ) = E at∼π(• | st) ∞ t=1 γ t-1 r(s t , a t ) s 1 = s ′ . Given a policy π, value function V θ π parameterized by θ can be learned via gradient descent and Monte-Carlo method: θ ← θ -α∇ θ ||V θ π (s t ) -(r t + γV θ ′ π (s t+1 ))|| 2 2 , for sampled (s t:t+1 , a t , r t ) ∼ τ π where τ π is trajectory rollouts collected with current policy π and θ ′ is a target network that gets updated by θ with momentum. Now consider our multi-step setup. Our multi-step Q-function can also be written as the sum of our H step approximation and discounted value function at s H+1 : Q π (s, a 1:H ) = E a H+t ∼π(• | s H+t ) st+1∼T (• | st,at) ∞ t=1 γ t-1 r t s 1 = s, a 1 = a 1 , • • • , a H = a H = Q H π (s, a 1:H ) + γ H V π (s H+1 ), where we note that in the last line, there is no expectation over s H+1 since the transition dynamics is deterministic, and s H+1 is deterministically decided by (s 1 , a 1:H ). Vanilla RaMP enables efficient estimation of Q π with novel reward function at the cost of truncating the second term above with Q π ≈ Q H π . As we have shown in Section 4, planning with this finitehorizon Q-approximation would already lead to reasonable planning in most of the experiments. We can go a step further and also estimate the second term so we can plan in an infinite-horizon setting. The main challenge is getting V π (s H+1 ) in our multi-step setup, as we don't explicitly predict s H+1 . This, however, can be addressed easily by reparameterizing V π (s H+1 ) on an action sequence that leads to s H+1 just like what we did for Q. We thus define a multi-step value function F π (s, a 1:H ) = E s H+1 V π (s H+1 ) s 1 = s, a 1 = a 1 , • • • , a H = a H . Then Q π (s, a 1:H ) = Q H π (s, a 1:H ) + γ H • F π (s, a 1:H ). Under our deterministic transition dynamics, s H+1 is fully determined by (s 1 , a 1:H ), so we can remove the expectation in the equation. We then rewrite the training objective V π in terms of F π to learn F π : For planning, we do on policy learning by alternating between policy rollout and Q π learning. As a policy, MPC planner first uses the infinite-horizon Q π (s, a 1:H ) = Q H π (s, a 1:H ) + γ H • F π (s, a 1:H ) to plan and collect rollouts. Then F π is trained with these on policy rollouts while Q H π is also learned like vanilla RaMP via online least square. By incorporating this infinite-horizon variant, our MPC planner can now plan with an infinite horizon during the online phase. θ ← θ -

E.2 EXPERIMENT AND ANALYSIS

We implemented the infinite-horizon variant described above and carried out experiments to quantitatively evaluate its effect on performance in four environments with varying types of tasks. As shown in Figure 11 , we found that the infinite-horizon variant out-performs vanilla RaMP with finite horizon Q-function in Hopper Jump and achieves comparable performance in D'Claw, both being the longer horizon environments among the four. However, the infinite-horizon variant actually performs worse in two Metaworld manipulation environments, ReachWall and FaucetOpen, likely because these tasks do not require infinite-horizon reasoning. In particular, the multi-step value function V has to be learned from scratch from samples, which may even hurt the performance at the beginning of the online phase. We notice this variant leads to better performance on longer-horizon tasks but doesn't increase the performance of metaworld tasks, likely because the benchmarked tasks don't require infinite-horizon reasoning. Since the V -function has to be learned from scratch, it could also hurt the performance at the beginning in our low data setting. We also present results of this variant with infinite-horizon Q-function on all 8 environments compared with all baselines. We run all the environments for 25600 steps, at twice the amount of steps we did in Section 4. We also changed the discount factor γ to 0.9 to emphasize the effectiveness of RaMP's quick adaptation to new reward using Q H π . As shown in Figure 12 , RaMP has the ability to quickly adapt to a new reward function, when the amount of data is extremely low. As the algorithm sees more data, the advantage of RaMP is not salient as our infinite-horizon value function learning is using vanilla bootstrapping. However, since the infinite horizon-variant uses bootstrapping, the method can continuously improve like all RL methods as samples grow. This will be useful for longer horizon and harder tasks. We note that although we used the simplest bootstrapping approach to learn multi-step V , we can use any other value estimation method to make it more efficient. The core of this variant is to parameterize the value function at the H + 1-step with the initial state and a sequence of actions. 

F COMPLEXITY ANALYSIS

The space complexity of RaMP is primarily determined by the number of random features that are needed. As we describe in Corollary C.2, we require K = Ω((1 -γ) -2 ϵ -2 ) random features to achieve ϵ-order error between the estimated and true Q-function. As the required approximation error decreases, the space needed to store all the features grows sublinearly. Note that this result is for any given reward function (including the target one) under any policy π, but not tied to a specific one (under the realizability assumption of the reward function). On the other hand, the space complexity of the classical successor feature method is relatively fixed in the above sense: the dimension of the feature is fixed and does not change with the accuracy of approximating the optimal Q-function for the target task. However, the resulting guarantee is also more restricted: it is only for the optimal Q-function of the specific target reward, and also depends on the distance between the previously seen and the target reward functions. Hence, the two space complexities are not necessarily comparable.



Figure 1: RaMP: Depiction of our proposed method for transferring behavior across tasks by leveraging modelfree learning of random features. At training time, Q-basis functions are trained on accumulated random features. At test time, adaptation is performed by solving linear regression and recombining basis functions, followed by online planning with MPC.

Figure 2: We evaluate our method on manipulation, locomotion, high dimensional action space environments. The green arrow in each environment indicates the online objective for policy transfer while the red arrows are offline objectives used to label rewards for the privileged dataset. value function under the open-loop policy π ′ H (see formal definition in §C.1). Then, we have that for all s ∈ S

Figure 3: Reward transfer results on Metaworld, Hopper and D'Claw environments. RaMP adapts to novel rewards more rapidly than MBPO, Successor Features, and CQL baselines. More experiments are in appendix.

Model-free Transfer with Randomized Cumulants and Model-predictive Control 1: Input: Offline dataset D given by (2.1), distribution p over R d , number of random features K Offline Training Phase: 3: Randomly sample {θ k } k∈[K] with θ k ∼ p, and construct dataset

Figure 4: Results on Hopper Stand

Figure 5: Results on Metaworld with high-dimensional pixel observation. RaMP achieves comparable performance to Dreamer on Pixel Reach and Pixel Faucet Open but struggles to perform well on Pixel Door Open. The performance of RaMP on Pixel Faucet Open and Pixel Reach is very similar to its performance in state observation environments.

Figure 6: Learning curves for different random feature dimensions. Low-dimensional random features suffer from poor convergence performance, whereas high-dimensional random features experience slow adaptation.

Figure 7: In-distribution results for Successor Features and CQL results. Note that RaMP and MBPO are unaffected since they do not depend on the distribution of the offline objectives.

Figure 8: Visualization of true Q value and approximated Q value. Our method is able to approximate the Q value in the face of outof-distribution and highly nonlinear rewards.

Figure9: MPPI results on MetaWorld and D'Claw. MPPI improves the performance of our method across all four tasks, showing that our method can benefit from powerful planners.

Figure 10: Finetuning results on MetaWorld and D'Claw. Our method is able to continuously improve by finetuning the Q-basis networks during online training.

where r(s, a; w * ) :=k∈[K] w * k ϕ(s, a; θ k ). (C.4)

a)∼ρ(s,a) rdR s,a (r) -r(s, a; w * ) where (C.5) uses that ∥ • ∥ ∞ ≤ ∥ • ∥ 2 for finite-dimensional vectors, (C.6) uses the definition of ρ. Further, by Jensen's inequality, for each (s, a) rdR s,a (r) -r(s, a; w * ) E r∼Rs,a(•) r -r(s, a; w * ) E r∼Rs,a(•) r -r(s, a; w * ) 2 , which, combined with (C.7) and (C.3), gives that

Figure 11: Results of RaMP variant with infinite horizon Q-function on four different environments.We notice this variant leads to better performance on longer-horizon tasks but doesn't increase the performance of metaworld tasks, likely because the benchmarked tasks don't require infinite-horizon reasoning. Since the V -function has to be learned from scratch, it could also hurt the performance at the beginning in our low data setting.

Figure12: We benchmark RaMP with infinite-horizon Q-function on all environments for more steps. RaMP can quickly adapt to new test rewards in low-data regime and can continuously improve with the addition of bootstrapping. However, it will need to rely on more-sample efficient value function learning methods to be more sample efficient in the infinite-horizon setting. This shows RaMP's benefits largely lie in the low-data regime

a m h ; θ k ) as the accumulated cumulants for open-loop action sequences {a 1 , • • • , a H } taken from state s 1 . We then use K function approximators representing each of the K Q-basis functions, e.g., neural networks ψ(•, •; ν k

5. Policy evaluation error. Feedforward dynamics model suffers from compounding error that is particularly noticeable in domains with high action dimensions or chaotic dynamics. Our method achieves low approximation error in both domains.

Return for different features. Random features are able to approximate the true reward well across domains. Polynomial features work in simple environments but do not scale to complex rewards. Gaussian features are unable to express the reward.

Return as a function of random feature dimension. Low dimensional random features are unable to approximate the true reward with linear regression, leading to degraded convergence performance.

Approximation error with different state dimensions. As state dimension increases, approximation error increases but remains in a reasonable range.

a 1:H ) is the fixed-point of the Bellman operator T H,π ′ H defined in (C.8). Then, we have that for all s ∈ S

a 1:H .Note that T H,π ′H is a contracting operator, and we denote the fixed point of the operator asQ H,π ′ H ∈ R |S|×|A H | , which is the Q-value function under open-loop policy π ′ H .By definition, we also know that the state-value function under π ′

C.12) for any π ∈ Π, where (C.9) uses the definition of π ′ H , (C.10) is due to the definition of Q max H , and (C.11) is by the max a H+1:2H , and (C.12) is by definition. Since (C.12) holds for any π ∈ Π, by the monotonicity of T H,π ′ H , we have Q H,π ′ H (s, a 1:H

α∇ θ ||V θ π (s t+H ) -(r t+H + γV θ ′ π (s t+H+1 ))|| 2 2 for sampled (s t+H:t+H+1 , a t+H , r t+H ) ∼ τ π becomes θ ← θ -α∇ θ ||F θ π (s t , a t:t+H-1 ) -(r t+H + γF θ ′ π (s t+1 , a t+1:t+H ))|| 2 2 for sampled (s t:t+1 ,a t:t+H , r t+H ) ∼ τ π where we parameterize multi-step value function F π by F θ π . So we can learn F π with Monte-Carlo sampling and gradient descent just like what we do to learn V π in the single-step case above. Combined with Q H π , we now have an estimation for the infinite-horizon Q-function Q π in a multistep manner.

