DOES ZERO-SHOT REINFORCEMENT LEARNING EXIST?

Abstract

A zero-shot RL agent is an agent that can solve any RL task in a given environment, instantly with no additional planning or learning, after an initial reward-free learning phase. This marks a shift from the reward-centric RL paradigm towards "controllable" agents that can follow arbitrary instructions in an environment. Current RL agents can solve families of related tasks at best, or require planning anew for each task. Strategies for approximate zero-shot RL have been suggested using successor features (SFs) (Borsa et al., 2018) or forward-backward (FB) representations (Touati & Ollivier, 2021), but testing has been limited. After clarifying the relationships between these schemes, we introduce improved losses and new SF models, and test the viability of zero-shot RL schemes systematically on tasks from the Unsupervised RL benchmark (Laskin et al., 2021) . To disentangle universal representation learning from exploration, we work in an offline setting and repeat the tests on several existing replay buffers. SFs appear to suffer from the choice of the elementary state features. SFs with Laplacian eigenfunctions do well, while SFs based on auto-encoders, inverse curiosity, transition models, low-rank transition matrix, contrastive learning, or diversity (APS), perform unconsistently. In contrast, FB representations jointly learn the elementary and successor features from a single, principled criterion. They perform best and consistently across the board, reaching 85% of supervised RL performance with a good replay buffer, in a zero-shot manner.

1. INTRODUCTION

For breadth of applications, reinforcement learning (RL) lags behind other fields of machine learning, such as vision or natural language processing, which have effectively adapted to a wide range of tasks, often in almost zero-shot manner, using pretraining on large, unlabelled datasets (Brown et al., 2020) . The RL paradigm itself may be in part to blame: RL agents are usually trained for only one reward function or a small family of related rewards. Instead, we would like to train "controllable" agents that can be given a description of any task (reward function) in their environment, and then immediately know what to do, reacting instantly to such commands as "fetch this object while avoiding that area". The promise of zero-shot RL is to train without rewards or tasks, yet immediately perform well on any reward function given at test time, with no extra training, planning, or finetuning, and only a minimal amount of extra computation to process a task description (Section 2 gives the precise definition we use for zero-shot RL). How far away are such zero-shot agents? In the RL paradigm, a new task (reward function) means re-training the agent from scratch, and providing many reward samples. Model-based RL trains a reward-free, task-independent world model, but still requires heavy planning when a new reward function is specified (e.g, Chua et al., 2018; Moerland et al., 2020) . Model-free RL is reward-centric from start, and produces specialized agents. Multi-task agents generalize within a family of related tasks only. Reward-free, unsupervised skill pre-training (e.g, Eysenbach et al., 2018) still requires substantial downstream task adaptation, such as training a hierarchical controller. Is zero-shot RL possible? If one ignores practicality, zero-shot RL is easy: make a list of all possible rewards up to precision 𝜀, then pre-learn all the associated optimal policies. Scalable zero-shot RL must somehow exploit the relationships between policies for all tasks. Learning to go from 𝑎 to 𝑐 is not independent from going from 𝑎 to 𝑏 and 𝑏 to 𝑐, and this produces rich, exploitable algebraic relationships (Blier et al., 2021; Schaul et al., 2015) . Figure 1 : Zero-shot scores of ten SF methods and FB, as a percentage of the supervised score of offline TD3 trained on the same replay buffer, averaged on some tasks, environments and replay buffers from the Unsupervised RL and ExORL benchmarks (Laskin et al., 2022; Yarats et al., 2021) . FB and SFs with Laplacian eigenfunctions achieve zero-shot scores approaching supervised RL. Suggested strategies for generic zero-shot RL so far have used successor representations (Dayan, 1993) , under two forms: successor features (SFs) (Barreto et al., 2017) as in (Borsa et al., 2018; Hansen et al., 2019; Liu & Abbeel, 2021) ; and forward-backward (FB) representations (Touati & Ollivier, 2021) . Both SFs and FB lie in between model-free and model-based RL, by predicting features of future states, or summarizing long-term state-state relationships. Like model-based approaches, they decouple the dynamics of the environment from the reward function. Contrary to world models, they require neither planning at test time nor a generative model of states or trajectories. Yet SFs heavily depend on a choice of basic state features. To get a full zero-shot RL algorithm, a representation learning method must provide those. While SFs have been successively applied to transfer between tasks, most of the time, the basic features were handcrafted or learned using prior task class knowledge. Meanwhile, FB is a standalone method with no task prior and good theoretical backing, but testing has been limited to goal-reaching in a few environments. Here: • We systematically assess SFs and FB for zero-shot RL, including many new models of SF basic features, and improved FB loss functions. We use 13 tasks from the Unsupervised RL benchmark (Laskin et al., 2021) , repeated on several ExORL training replay buffers (Yarats et al., 2021) to assess robustness to the exploration method. • We systematically study the influence of basic features for SFs, by testing SFs on features from ten RL representation learning methods. such as latent next state prediction, inverse curiosity module, contrastive learning, diversity, various spectral decompositions... • We expose new mathematical links between SFs, FB, and other representations in RL. • We discuss the implicit assumptions and limitations behind zero-shot RL approaches.

2. PROBLEM AND NOTATION; DEFINING ZERO-SHOT RL

Let ℳ = (𝑆, 𝐴, 𝑃, 𝛾) be a reward-free Markov decision process (MDP) with state space 𝑆, action space 𝐴, transition probabilities 𝑃 (𝑠 ′ |𝑠, 𝑎) from state 𝑠 to 𝑠 ′ given action 𝑎, and discount factor 0 < 𝛾 < 1 (Sutton & Barto, 2018) . If 𝑆 and 𝐴 are finite, 𝑃 (𝑠 ′ |𝑠, 𝑎) can be viewed as a stochastic matrix 𝑃 𝑠𝑎𝑠 ′ ∈ R (|𝑆|×|𝐴|)×|𝑆| ; in general, for each (𝑠, 𝑎) ∈ 𝑆 × 𝐴, 𝑃 (d𝑠 ′ |𝑠, 𝑎) is a probability measure on 𝑠 ′ ∈ 𝑆. The notation 𝑃 (d𝑠 ′ |𝑠, 𝑎) covers all cases. Given (𝑠 0 , 𝑎 0 ) ∈ 𝑆 × 𝐴 and a policy 𝜋 : 𝑆 → Prob(𝐴), we denote Pr(•|𝑠 0 , 𝑎 0 , 𝜋) and E[•|𝑠 0 , 𝑎 0 , 𝜋] the probabilities and expectations under state-action sequences (𝑠 𝑡 , 𝑎 𝑡 ) 𝑡≥0 starting at (𝑠 0 , 𝑎 0 ) and following policy 𝜋 in the environment, defined by sampling 𝑠 𝑡 ∼ 𝑃 (d𝑠 𝑡 |𝑠 𝑡-1 , 𝑎 𝑡-1 ) and 𝑎 𝑡 ∼ 𝜋(d𝑎 𝑡 |𝑠 𝑡 ). We define 𝑃 𝜋 (d𝑠 ′ , d𝑎 ′ |𝑠, 𝑎) := 𝑃 (d𝑠 ′ |𝑠, 𝑎)𝜋(d𝑎 ′ |𝑠 ′ ) and 𝑃 𝜋 (d𝑠 ′ |𝑠) := ∫︀ 𝑃 (d𝑠 ′ |𝑠, 𝑎)𝜋(d𝑎|𝑠), the state-action transition probabilities and state transition probabilities induced by 𝜋. Given a reward function 𝑟 : 𝑆 → R, the 𝑄-function of 𝜋 for 𝑟 is 𝑄 𝜋 𝑟 (𝑠 0 , 𝑎 0 ) := ∑︀ 𝑡≥0 𝛾 𝑡 E[𝑟(𝑠 𝑡+1 )|𝑠 0 , 𝑎 0 , 𝜋]. For simplicity, we assume the reward 𝑟 depends only on the next state 𝑠 𝑡+1 instead on the full triplet (𝑠 𝑡 , 𝑎 𝑡 , 𝑠 𝑡+1 ), but this is not essential. We focus on offline unsupervised RL, where the agent cannot interact with the environment. The agent only has access to a static dataset of logged reward-free transitions in the environment, 𝒟 = {(𝑠 𝑖 , 𝑎 𝑖 , 𝑠 ′ 𝑖 )} 𝑖∈ℐ with 𝑠 ′ 𝑖 ∼ 𝑃 (d𝑠 ′ 𝑖 |𝑠 𝑖 , 𝑎 𝑖 ). These can come from any exploration method or methods. The offline setting disentangles the effects of the exploration method and representation and policy learning: we test each zero-shot method on several training datasets from several exploration methods. We denote by 𝜌(d𝑠) and 𝜌(d𝑠, d𝑎) the (unknown) marginal distribution of states and state-actions in the dataset 𝒟. We use both E 𝑠∼𝒟 [•] and E 𝑠∼𝜌 [•] for expectations under the training distribution. Zero-shot RL: problem statement. The goal of zero-shot RL is to compute a compact representation ℰ of the environment by observing samples of reward-free transitions (𝑠 𝑡 , 𝑎 𝑡 , 𝑠 𝑡+1 ) in this environment. Once a reward function is specified later, the agent must use ℰ to immediately produce a good policy, via only elementary computations without any further planning or learning. Ideally, for any downstream task, the performance of the returned policy should be close to the performance of a supervised RL baseline trained on the same dataset labeled with the rewards for that task. Reward functions may be specified at test time either as a relatively small set of reward samples (𝑠 𝑖 , 𝑟 𝑖 ), or as an explicit function 𝑠 ↦ → 𝑟(𝑠) (such as 1 at a known goal state and 0 elsewhere). The method will be few-shot, zero-planning in the first case, and truly zero-shot in the second case.

3. RELATED WORK

Zero-shot RL requires unsupervised learning and the absence of any planning or fine-tuning at test time. The proposed strategies for zero-shot RL discussed in Section 1 ultimately derive from successor representations (Dayan, 1993) in finite spaces. In continuous spaces, starting with a finite number of features 𝜙, successor features can be used to produce policies within a family of tasks directly related to 𝜙 (Barreto et al., 2017; Borsa et al., 2018; Zhang et al., 2017; Grimm et al., 2019) , often using hand-crafted 𝜙 or learning 𝜙 that best linearize training rewards. VISR (Hansen et al., 2019) and its successor APS (Liu & Abbeel, 2021) use SFs with 𝜙 automatically built online via diversity criteria (Eysenbach et al., 2018; Gregor et al., 2016) . We include APS among our baselines, as well as many new criteria to build 𝜙 automatically. Successor measures (Blier et al., 2021) avoid the need for 𝜙 by directly learning models of the distribution of future states: doing this for various policies yields a candidate zero-shot RL method, forward-backward representations (Touati & Ollivier, 2021) , which has been tested for goal-reaching in a few environments with discrete actions. FB uses a low-rank model of long-term state-state relationships reminiscent of the state-goal factorization from Schaul et al. (2015) . Model-based RL (surveyed in Moerland et al. (2020) ) misses the zero-planning requirement of zero-shot RL. Still, learned models of the transitions between states can be used jointly with SFs to provide zero-shot methods (Trans, Latent, and LRA-P methods below). Goal-oriented and multitask RL has a long history (e.g, Foster & Dayan, 2002; Sutton et al., 2011; da Silva et al., 2012; Schaul et al., 2015; Andrychowicz et al., 2017) . A parametric family of tasks must be defined in advance (e.g., reaching arbitrary goal states). New rewards cannot be set a posteriori: for example, a goal-state-oriented method cannot handle dense rewards. Zero-shot task transfer methods learn on tasks and can transfer to related tasks only (e.g, Oh et al., 2017; Sohn et al., 2018) ; this can be used, e.g., for sim-to-real transfer (Genc et al., 2020) or slight environment changes, which is not covered here. Instead, we aim at not having any predefined family of tasks. Unsupervised skill and option discovery methods, based for instance on diversity (Eysenbach et al., 2018; Gregor et al., 2016) or eigenoptions (Machado et al., 2017) can learn a variety of behaviors without rewards. Downstream tasks require learning a hierarchical controller to combine the right skills or options for each task. Directly using unmodified skills has limited performance without heavy finetuning (Eysenbach et al., 2018) . Still, these methods can speed up downstream learning. The unsupervised aspect of some of these methods (including DIAYN and APS) has been disputed, because training still used end-of-trajectory signals, which are directly correlated to the downstream task in some common environments: without this signal, results drop sharply (Laskin et al., 2022) .

4. SUCCESSOR REPRESENTATIONS AND ZERO-SHOT RL

For a finite MDP, the successor representation (SR) (Dayan, 1993 ) 𝑀 𝜋 (𝑠 0 , 𝑎 0 ) of a state-action pair (𝑠 0 , 𝑎 0 ) under a policy 𝜋, is defined as the discounted sum of future occurrences of each state: 𝑀 𝜋 (𝑠 0 , 𝑎 0 , 𝑠) := E [︁ ∑︀ 𝑡≥0 𝛾 𝑡 1 {𝑠𝑡+1=𝑠} | (𝑠 0 , 𝑎 0 ), 𝜋 ]︁ ∀𝑠 ∈ 𝑆. In matrix form, SRs can be written as 𝑀 𝜋 = 𝑃 ∑︀ 𝑡≥0 𝛾 𝑡 𝑃 𝑡 𝜋 = 𝑃 (Id -𝛾𝑃 𝜋 ) -1 , where 𝑃 𝜋 is the state transition probability. 𝑀 𝜋 satisfies the matrix Bellman equation 𝑀 𝜋 = 𝑃 + 𝛾𝑃 𝜋 𝑀 𝜋 . Importantly, SRs disentangle the dynamics of the MDP and the reward function: for any reward 𝑟 and policy 𝜋, the 𝑄-function can be expressed linearly as 𝑄 𝜋 𝑟 = 𝑀 𝜋 𝑟. Successor features and successor measures. Successor features (SFs) (Barreto et al., 2017) extend SR to continous MDPs by first assuming we are given a basic feature map 𝜙 : 𝑆 → R 𝑑 that embeds states into 𝑑-dimensional space, and defining the expected discounted sum of future state features: 𝜓 𝜋 (𝑠 0 , 𝑎 0 ) := E [︁ ∑︀ 𝑡≥0 𝛾 𝑡 𝜙(𝑠 𝑡+1 ) | 𝑠 0 , 𝑎 0 , 𝜋 ]︁ . ( ) SFs have been introduced to make SRs compatible with function approximation. For a finite MDP, the original definition ( 1) is recovered by letting 𝜙 be a one-hot state encoding into R |𝑆| . Alternatively, successor measures (SMs) (Blier et al., 2021) extend SRs to continuous spaces by treating the distribution of future visited states as a measure 𝑀 𝜋 over the state space 𝑆, 𝑀 𝜋 (𝑠 0 , 𝑎 0 , 𝑋) := ∑︀ 𝑡≥0 𝛾 𝑡 Pr (𝑠 𝑡+1 ∈ 𝑋 | 𝑠 0 , 𝑎 0 , 𝜋) ∀𝑋 ⊂ 𝑆. SFs and SMs are related: by construction, 𝜓 𝜋 (𝑠 0 , 𝑎 0 ) = ∫︀ 𝑠 ′ 𝑀 𝜋 (𝑠 0 , 𝑎 0 , d𝑠 ′ ) 𝜙(𝑠 ′ ). Zero-shot RL from successor features and forward-backward representations. Successor representations provide a generic framework for zero-shot RL, by learning to represent the relationship between reward functions and 𝑄-functions, as encoded in 𝑀 𝜋 . Given a basic feature map 𝜙 : 𝑆 → R 𝑑 to be learned via another criterion, universal SFs (Borsa et al., 2018) learn the successor features of a particular family of policies 𝜋 𝑧 for 𝑧 ∈ R 𝑑 , 𝜓(𝑠 0 , 𝑎 0 , 𝑧) = E [︁ ∑︀ 𝑡≥0 𝛾 𝑡 𝜙(𝑠 𝑡+1 ) | (𝑠 0 , 𝑎 0 ), 𝜋 𝑧 ]︁ , 𝜋 𝑧 (𝑠) := arg max 𝑎 𝜓(𝑠, 𝑎, 𝑧) ⊤ 𝑧. (4) Once a reward function 𝑟 is revealed, we use a few reward samples or explicit knowledge of the function 𝑟 to perform a linear regression of 𝑟 onto the features 𝜙. Namely, we estimate 𝑧 𝑟 := arg min 𝑧 E 𝑠∼𝜌 [(𝑟(𝑠) -𝜙(𝑠) ⊤ 𝑧) 2 ] = E 𝜌 [𝜙𝜙 ⊤ ] -1 E 𝜌 [𝜙𝑟]. Then we return the policy 𝜋 𝑧𝑟 . This policy is guaranteed to be optimal for all rewards in the linear span of the features 𝜙: Theorem 1 (Borsa et al. (2018) ). Assume that (4) holds. Assume there exists a weight 𝑤 ∈ R 𝑑 such that 𝑟(𝑠) = 𝜙(𝑠) ⊤ 𝑤, ∀𝑠 ∈ 𝑆. Then 𝑧 𝑟 = 𝑤, and 𝜋 𝑧𝑟 is the optimal policy for reward 𝑟. Forward-backward (FB) representations (Touati & Ollivier, 2021) apply a similar idea to a finite-rank model of successor measures. They look for representations 𝐹 : 𝑆 × 𝐴 × R 𝑑 → R 𝑑 and 𝐵 : 𝑆 → R 𝑑 such that the long-term transition probabilities 𝑀 𝜋𝑧 in (3) decompose as 𝑀 𝜋𝑧 (𝑠 0 , 𝑎 0 , d𝑠 ′ ) ≈ 𝐹 (𝑠 0 , 𝑎 0 , 𝑧) ⊤ 𝐵(𝑠 ′ ) 𝜌(d𝑠 ′ ), 𝜋 𝑧 (𝑠) := arg max 𝑎 𝐹 (𝑠, 𝑎, 𝑧) ⊤ 𝑧 In a finite space, the first equation rewrites as the matrix decomposition 𝑀 𝜋𝑧 = 𝐹 ⊤ 𝑧 𝐵 diag(𝜌). Once a reward function 𝑟 is revealed, we estimate 𝑧 𝑟 := E 𝑠∼𝜌 [𝑟(𝑠)𝐵(𝑠)] from a few reward samples or from explicit knowledge of the function 𝑟 (e.g. 𝑧 𝑟 = 𝐵(𝑠) to reach 𝑠). Then we return the policy 𝜋 𝑧𝑟 . If the approximation (5) holds, this policy is guaranteed to be optimal for any reward function: Theorem 2 (Touati & Ollivier (2021) ). Assume that (5) holds. Then for any reward function 𝑟, the policy 𝜋 𝑧𝑟 is optimal for 𝑟, with optimal 𝑄-function 𝑄 ⋆ 𝑟 = 𝐹 (𝑠, 𝑎, 𝑧 𝑟 ) ⊤ 𝑧 𝑟 . For completeness, we sketch the proofs of Theorems 1-2 in Appendix A. Importantly, both theorems are compatible with approximation: approximate solutions provide approximately optimal policies. Connections between SFs and FB. A first difference between SFs and FB is that SFs must be provided with basic features 𝜙. The best 𝜙 is such that the reward functions of the downstream tasks are linear in 𝜙. But for unsupervised training without prior task knowledge, an external criterion is needed to learn 𝜙. We test a series of such criteria below. In contrast, FB uses a single criterion, avoiding the need for state featurization by learning a model of state occupancy. Second, SFs only cover rewards in the linear span of 𝜙, while FB apparently covers any reward. But this difference is not as stark as it looks: exactly solving the FB equation ( 5) in continuous spaces requires 𝑑 = ∞, and for finite 𝑑, the policies will only be optimal for rewards in the linear span of 𝐵 (Touati & Ollivier, 2021) . Thus, in both cases, policies are exactly optimal only for a 𝑑-dimensional family of rewards. Still, FB can use an arbitrary large 𝑑 without any additional input or criterion. FB representations are related to successor features: the FB definition (5) implies that 𝜓(𝑠, 𝑎, 𝑧) := 𝐹 (𝑠, 𝑎, 𝑧) are the successor features of 𝜙(𝑠) := (E 𝜌 𝐵𝐵 ⊤ ) -1 𝐵(𝑠). This follows from multiplying (3) and ( 5) by 𝐵 ⊤ (E 𝜌 𝐵𝐵 ⊤ ) -1 on the right, and integrating over 𝑠 ′ ∼ 𝜌. Thus, a posteriori, FB can be used to produce both 𝜙 and 𝜓 in SF, although training is different. This connection between FB and SF is one-directional: (5) is a stronger condition. In particular 𝐹 = 𝐵 = 0 is not a solution: contrary to 𝜓 = 𝜙 = 0 in (4), there is no collapse. No additional criterion to train 𝜙 is required: 𝐹 and 𝐵 are trained jointly to provide the best rank-𝑑 approximation of the successor measures 𝑀 𝜋 . This summarizes an environment by selecting the features that best describe the relationship 𝑄 𝜋 𝑟 = 𝑀 𝜋 𝑟 between rewards and 𝑄-functions.

5. ALGORITHMS FOR SUCCESSOR FEATURES AND FB REPRESENTATIONS

We now describe more precisely the algorithms used in our experiments. The losses used to train 𝜓 in SFs, and 𝐹, 𝐵 in FB, are described in Sections 5.1 and 5.2 respectively. To obtain a full zero-shot RL algorithm, SFs must specify the basic features 𝜙. Any representation learning method can be used for 𝜙. We use ten possible choices (Section 5.3) based on existing or new representations for RL: random features as a baseline, autoencoders, next state and latent next state transition models, inverse curiosity module, the diversity criterion of APS, contrastive learning, and finally, several spectral decompositions of the transition matrix or its associated Laplacian. Both SFs and FB define policies as an argmax of 𝜓(𝑠, 𝑎, 𝑧) ⊤ 𝑧 or 𝐹 (𝑠, 𝑎, 𝑧) ⊤ 𝑧 over actions 𝑎. With continuous actions, the argmax cannot be computed exactly. We train an auxiliary policy network 𝜋(𝑠, 𝑧) to approximate this argmax, using the same standard method for SFs and FB (Appendix G.4).

5.1. LEARNING THE SUCCESSOR FEATURES 𝜓

The successor features 𝜓 satisfy the R 𝑑 -valued Bellman equation 𝜓 𝜋 = 𝑃 𝜙 + 𝛾𝑃 𝜋 𝜓 𝜋 , the collection of ordinary Bellman equations for each component of 𝜙. The 𝑃 in front of 𝜙 comes from using 𝜙(𝑠 𝑡+1 ) not 𝜙(𝑠 𝑡 ) in (2). Therefore, we can train 𝜓(𝑠, 𝑎, 𝑧) for each 𝑧 by minimizing the Bellman residuals ⃦ ⃦ 𝜓(𝑠 𝑡 , 𝑎 𝑡 , 𝑧) -𝜙(𝑠 𝑡+1 ) -𝛾 ψ(𝑠 𝑡+1 , 𝜋 𝑧 (𝑠 𝑡+1 ), 𝑧) ⃦ ⃦ 2 where ψ is a non-trainable target version of 𝜓 as in parametric 𝑄-learning. This requires sampling a transition (𝑠 𝑡 , 𝑎 𝑡 , 𝑠 𝑡+1 ) from the dataset and choosing 𝑧. We sample random values of 𝑧 as described in Appendix G.3. This is the loss used in Borsa et al. (2018) . But this can be improved, since we do not use the full vector 𝜓(𝑠, 𝑎, 𝑧): only 𝜓(𝑠, 𝑎, 𝑧) ⊤ 𝑧 is needed for the policies. Therefore, as in Liu & Abbeel (2021) , instead of the vector-valued Bellman residual above, we just use ℒ(𝜓) := E (𝑠𝑡,𝑎𝑡,𝑠𝑡+1)∼𝜌 (︀ 𝜓(𝑠 𝑡 , 𝑎 𝑡 , 𝑧) ⊤ 𝑧 -𝜙(𝑠 𝑡+1 ) ⊤ 𝑧 -𝛾 ψ(𝑠 𝑡+1 , 𝜋 𝑧 (𝑠 𝑡+1 ), 𝑧) ⊤ 𝑧 )︀ 2 (6) for each 𝑧. This trains 𝜓(•, 𝑧) ⊤ 𝑧 as the 𝑄-function of reward 𝜙 ⊤ 𝑧, the only case needed, while training the full vector 𝜓(•, 𝑧) amounts to training the 𝑄-functions of each policy 𝜋 𝑧 for all rewards 𝜙 ⊤ 𝑧 ′ for all 𝑧 ′ ∈ R 𝑑 including 𝑧 ′ ̸ = 𝑧. We have found this improves performance.

5.2. LEARNING FB REPRESENTATIONS: THE FB TRAINING LOSS

The successor measure 𝑀 𝜋 satisfies a Bellman-like equation 𝑀 𝜋 = 𝑃 + 𝛾𝑃 𝜋 𝑀 𝜋 , as matrices in the finite case and as measures in the general case (Blier et al., 2021) . We can learn FB by iteratively minimizing the Bellman residual on the parametric model 𝑀 = 𝐹 ⊤ 𝐵𝜌. Using a suitable norm ‖•‖ 𝜌 for the Bellman residual (Appendix B) leads to a loss expressed as expectations from the dataset: ℒ(𝐹, 𝐵) := ⃦ ⃦ 𝐹 ⊤ 𝑧 𝐵𝜌 - (︀ 𝑃 + 𝛾𝑃 𝜋𝑧 F⊤ 𝑧 B𝜌 )︀⃦ ⃦ 2 𝜌 (7) = E (𝑠𝑡,𝑎𝑡,𝑠𝑡+1)∼𝜌 𝑠 ′ ∼𝜌 [︂ (︁ 𝐹 (𝑠 𝑡 , 𝑎 𝑡 , 𝑧) ⊤ 𝐵(𝑠 ′ ) -𝛾 F (𝑠 𝑡+1 , 𝜋 𝑧 (𝑠 𝑡+1 ), 𝑧) ⊤ B(𝑠 ′ ) )︁ 2 ]︂ -2 E (𝑠𝑡,𝑎𝑡,𝑠𝑡+1)∼𝜌 [︀ 𝐹 (𝑠 𝑡 , 𝑎 𝑡 , 𝑧) ⊤ 𝐵(𝑠 𝑡+1 ) ]︀ + Const (8) where the constant term does not depend on 𝐹 and 𝐵, and where as usual F𝑧 and B are non-trainable target versions of 𝐹 and 𝐵 whose parameters are updated with a slow-moving average of those of 𝐹 and 𝐵. Appendix B quickly derives this loss, with pseudocode in Appendix L. Contrary to Touati & Ollivier (2021) , the last term involves 𝐵(𝑠 𝑡+1 ) instead of 𝐵(𝑠 𝑡 ), because we use 𝑠 𝑡+1 instead of 𝑠 𝑡 for the successor measures (3). We sample random values of 𝑧 as described in Appendix G.3. We include an auxiliary loss (Appendix B) to normalize the covariance of 𝐵, E 𝜌 𝐵𝐵 ⊤ ≈ Id, as in Touati & Ollivier (2021) (otherwise one can, e.g., scale 𝐹 up and 𝐵 down since only 𝐹 ⊤ 𝐵 is fixed). As with SFs above, only 𝐹 (•, 𝑧) ⊤ 𝑧 is needed for the policies, while the loss above on the vector 𝐹 amounts to training 𝐹 (•, 𝑧) ⊤ 𝑧 ′ for all pairs (𝑧, 𝑧 ′ ). The full loss is needed for joint training of 𝐹 and 𝐵. But we include an auxiliary loss ℒ ′ (𝐹 ) to focus training on the diagonal 𝑧 ′ = 𝑧. This is obtained by multiplying the Bellman gap in ℒ by 𝐵 ⊤ (𝐵𝜌𝐵 ⊤ ) -1 𝑧 on the right, to make 𝐹 (•, 𝑧) ⊤ 𝑧 appear: ℒ ′ (𝐹 ) := E (𝑠𝑡,𝑎𝑡,𝑠𝑡+1)∼𝜌 [︁ (︀ 𝐹 (𝑠 𝑡 , 𝑎 𝑡 , 𝑧) ⊤ 𝑧 -𝐵(𝑠 𝑡+1 ) ⊤ (E 𝜌 𝐵𝐵 ⊤ ) -1 𝑧 -𝛾 F (𝑠 𝑡+1 , 𝜋 𝑧 (𝑠 𝑡+1 ), 𝑧) ⊤ 𝑧 )︀ 2 ]︁ . (9) This trains 𝐹 (•, 𝑧) ⊤ 𝑧 as the 𝑄-function for reward 𝐵 ⊤ (E 𝜌 𝐵𝐵 ⊤ ) -1 𝑧. Though ℒ = 0 implies ℒ ′ = 0, adding ℒ ′ reduces the error on the part used for policies. This departs from Touati & Ollivier (2021) .

5.3. LEARNING BASIC FEATURES 𝜙 FOR SUCCESSOR FEATURES

SFs must be provided with basic state features 𝜙. Any representation learning method can be used to supply 𝜙. We focus on prominent RL representation learning baselines, and on those used in previous zero-shot RL candidates such as APS. We now describe the precise learning objective for each. Random Features (Rand). We use a non-trainable randomly initialized network as features. Autoencoder (AEnc). We learn a decoder 𝑓 : R 𝑑 → 𝑆 to recover the state from its representation 𝜙: min 𝑓,𝜙 E 𝑠∼𝒟 [(𝑓 (𝜙(𝑠)) -𝑠) 2 ]. Inverse Curiosity Module (ICM) aims at extracting the controllable aspects of the environment (Pathak et al., 2017) . The idea is to train an inverse dynamics model 𝑔 : R 𝑑 × R 𝑑 → 𝐴 to predict the action used for a transition between two consecutive states. We use the loss min 𝑔,𝜙 E (𝑠𝑡,𝑎𝑡,𝑠𝑡+1)∼𝒟 [‖𝑔(𝜙(𝑠 𝑡 ), 𝜙(𝑠 𝑡+1 )) -𝑎 𝑡 ‖ 2 ]. Transition model (Trans) . This is a one-step forward dynamic model 𝑓 : R 𝑑 × 𝐴 → 𝑆 that predicts the next state from the current state representation: min 𝑓,𝜙 E (𝑠𝑡,𝑎𝑡,𝑠𝑡+1)∼𝒟 [(𝑓 (𝜙(𝑠 𝑡 ), 𝑎 𝑡 ) -𝑠 𝑡+1 ) 2 ]. Latent transition model (Latent). This is similar to the transition model but instead of predicting the next state, it predicts its representation: min 𝑓,𝜙 E (𝑠𝑡,𝑎𝑡,𝑠𝑡+1)∼𝒟 [(𝑓 (𝜙(𝑠 𝑡 ), 𝑎 𝑡 ) -𝜙(𝑠 𝑡+1 )) 2 ]. A clear failure case of this loss is when all states are mapped to the same representation. To avoid this collapse, we compute 𝜙(𝑠 𝑡+1 ) using a non-trainable version of 𝜙, with parameters corresponding to a slowly moving average of the parameters of 𝜙, similarly to BYOL (Grill et al., 2020) . Diversity methods (APS). VISR (Hansen et al., 2019) and its successor APS (Liu & Abbeel, 2021) tackle zero-shot RL using SFs with features 𝜙 built online from a diversity criterion. This criterion maximizes the mutual information between a policy parameter and the features of the states visited by a policy using that parameter (Eysenbach et al., 2018; Gregor et al., 2016) . VISR and APS use, respectively, a variational or nearest-neighbor estimator for the mutual information. We directly use the code provided for APS, and refer to (Liu & Abbeel, 2021) for the details. Contrary to other methods, APS is not offline: it needs to be trained on its own replay buffer. Laplacian Eigenfunctions (Lap). Wu et al. (2018) consider the symmetrized MDP graph Laplacian induced by an exploratory policy 𝜋, defined as ℒ = Id -1 2 (𝑃 𝜋 diag(𝜌) -1 + diag(𝜌) -1 (𝑃 𝜋 ) ⊤ ). They propose to learn the eigenfunctions of ℒ via the spectral graph drawing objective (Koren, 2003) : min 𝜙 E (𝑠𝑡,𝑠𝑡+1)∼𝒟 [︁ ‖𝜙(𝑠 𝑡 ) -𝜙(𝑠 𝑡+1 )‖ 2 ]︁ + 𝜆 E 𝑠∼𝒟 𝑠 ′ ∼𝒟 [︀ (𝜙(𝑠) ⊤ 𝜙(𝑠 ′ )) 2 -‖𝜙(𝑠)‖ 2 2 -‖𝜙(𝑠 ′ )‖ 2 2 ]︀ (14) where the second term is an orthonormality regularization to ensure that E 𝑠∼𝜌 [𝜙(𝑠)𝜙(𝑠) ⊤ ] ≈ Id, and 𝜆 > 0 is the regularization weight. This is implicitly contrastive, pushing features of 𝑠 𝑡 and 𝑠 𝑡+1 closer while keeping features apart overall. Such eigenfunctions have long been argued to play a key role in RL (Mahadevan & Maggioni, 2007; Machado et al., 2017) . Low-Rank Approximation of 𝑃 (LRA-P): we learn features by estimating a low-rank model of the transition probability densities: 𝑃 (d𝑠 ′ |𝑠, 𝑎) ≈ 𝜒(𝑠, 𝑎) ⊤ 𝜇(𝑠 ′ ) 𝜌(d𝑠 ′ ). Knowing 𝜌 is not needed: the corresponding loss on 𝜒 ⊤ 𝜇 -𝑃/𝜌 is readily expressed as expectations over the dataset, min 𝜒,𝜇 E (𝑠𝑡,𝑎𝑡)∼𝜌 𝑠 ′ ∼𝜌 [︃ (︂ 𝜒(𝑠 𝑡 , 𝑎 𝑡 ) ⊤ 𝜇(𝑠 ′ ) - 𝑃 (d𝑠 ′ |𝑠 𝑡 , 𝑎 𝑡 ) 𝜌(d𝑠 ′ ) )︂ 2 ]︃ (15) = E (𝑠𝑡,𝑎𝑡)∼𝜌 𝑠 ′ ∼𝜌 [(𝜒(𝑠 𝑡 , 𝑎 𝑡 ) ⊤ 𝜇(𝑠 ′ )) 2 ] -2 E (𝑠𝑡,𝑎𝑡,𝑠𝑡+1)∼𝜌 [𝜒(𝑠 𝑡 , 𝑎 𝑡 ) ⊤ 𝜇(𝑠 𝑡+1 )] + Const (16) We normalize E 𝜌 [𝜇𝜇 ⊤ ] ≈ Id with the same loss used for 𝐵. Then we use SFs with 𝜙 := 𝜇. If the model 𝑃 = 𝜒 ⊤ 𝜇 𝜌 is exact, this provides exact optimal policies for any reward (Appendix E, Thm. 3). This loss is implicitly contrastive: it compares samples 𝑠 𝑡+1 to independent samples 𝑠 ′ from 𝜌. It is an asymmetric extension of the Laplacian loss ( 14) (Appendix F). The loss ( 16) is also a special case of the FB loss ( 8) by setting 𝛾 = 0, omitting 𝑧, and substituting (𝜒, 𝜇) for (𝐹, 𝐵). Indeed, FB learns a finite-rank model of 𝑃 (Id -𝛾𝑃 𝜋 ) -1 , which equals 𝑃 when 𝛾 = 0. A related loss is introduced in Ren et al. ( 2022), but involves a second unspecified, arbitrary probability distribution 𝑝, which must cover the whole state space and whose analytic expression must be known. It is unclear how to set a suitable 𝑝 in general. Contrastive Learning (CL) methods learn representations by pushing positive pairs (similar states) closer together while keeping negative pairs apart. Here, two states are considered similar if they lie close on the same trajectory. We use a SimCLR-like objective (Chen et al., 2020) : min 𝜒,𝜙 -E 𝑘∼Geom(1-𝛾CL) (𝑠𝑡,𝑠 𝑡+𝑘 )∼𝒟 [︂ log exp(cosine(𝜒(𝑠 𝑡 ), 𝜙(𝑠 𝑡+𝑘 ))) E 𝑠 ′ ∼𝒟 exp(cosine(𝜒(𝑠 𝑡 ), 𝜙(𝑠 ′ ))) ]︂ where 𝑠 𝑡+𝑘 is the state encountered at step 𝑡+𝑘 along the subtrajectory that starts at 𝑠 𝑡 , where 𝑘 is sampled from a geometric distribution of parameter (1 -𝛾 CL ), and cosine(𝑢, 𝑣) = 𝑢 ⊤ 𝑣 ‖𝑢‖2‖𝑣‖2 , ∀𝑢, 𝑣 ∈ R 𝑑 is the cosine similarity function. Here 𝛾 CL ∈ [0; 1) is a parameter not necessarily set to the MDP's discount factor 𝛾. CL requires a dataset made of full trajectories instead of isolated transitions. CL is tightly related to the spectral decomposition of the successor measure ∑︀ 𝑡 𝛾 𝑡 CL 𝑃 𝑡+1 𝜋 , where 𝜋 is the behavior policy generating the dataset trajectories. Precisely, assuming that 𝜒 and 𝜙 are centered with unit norm, and expanding the log and exp at second order, the loss ( 17) becomes (17) ≈ 1 2 E 𝑠∼𝒟 𝑠 ′ ∼𝒟 [(𝜒(𝑠) ⊤ 𝜙(𝑠 ′ )) 2 ] -E 𝑘∼Geom(1-𝛾CL) (𝑠𝑡,𝑠 𝑡+𝑘 )∼𝒟 [︀ 𝜒(𝑠 𝑡 ) ⊤ 𝜙(𝑠 𝑡+𝑘 ) ]︀ (compare ( 16)). Now, the law of 𝑠 𝑡+𝑘 given 𝑠 𝑡 is given by the stochastic matrix (1-𝛾 CL ) ∑︀ 𝑡 𝛾 𝑡 CL 𝑃 𝑡+1 𝜋 , the rescaled successor measure of 𝜋. Then one finds that ( 18) is minimized when 𝜒 and 𝜙 provide the singular value decomposition of this matrix in 𝐿 2 (𝜌) norm (Appendix C). Formal links between contrastive learning and spectral methods can be found in Tian (2022); Balestriero & LeCun (2022) .

Low-Rank Approximation of SR (LRA-SR).

The CL method implicitly factorizes the successor measure of the exploration policy in Monte Carlo fashion by sampling pairs (𝑠 𝑡 , 𝑠 𝑡+𝑘 ) on the same trajectory. This may suffer from high variance. To mitigate this, we propose to factorize this successor measure by temporal difference learning instead of Monte Carlo. This is achieved with an FB-like loss (8) except we drop the policies 𝜋 𝑧 and learn successor measures for the exploration policy only: min 𝜒,𝜙 E (𝑠𝑡,𝑠𝑡+1)∼𝒟 𝑠 ′ ∼𝒟 [︁ (︀ 𝜒(𝑠 𝑡 ) ⊤ 𝜙(𝑠 ′ ) -𝛾 χ(𝑠 𝑡+1 ) ⊤ φ(𝑠 ′ ) )︀ 2 ]︁ -2 E (𝑠𝑡,𝑠𝑡+1)∼𝒟 [︀ 𝜒(𝑠 𝑡 ) ⊤ 𝜙(𝑠 𝑡+1 ))) ]︀ with χ and φ target versions of 𝜒 and 𝜙. We normalize 𝜙 to E 𝜌 [𝜙𝜙 ⊤ ] ≈ Id with the same loss as for 𝐵. Of all SF variants tested, this is the closest to FB. 

6. EXPERIMENTAL RESULTS ON BENCHMARKS

Each of the 11 methods (FB and 10 SF-based models) has been tested on 13 tasks in 4 environments from the Unsupervised RL and ExORL benchmarks (Laskin et al., 2021; Yarats et al., 2021) : Maze (reach 20 goals), Walker (stand, walk, run, flip), Cheetah (walk, run, walk backwards, run backwards), and Quadruped (stand, walk, run, jump); see Appendix G.1. Each task and method was repeated for 3 choices of replay buffer from ExORL: RND, APS, and Proto (except for the APS method, which can only train on the APS buffer). Each of these 403 settings was repeated with 10 random seeds. The full setup is described in Appendix G. Representation dimension is 𝑑 = 50, except for Maze (𝑑 = 100). After model training, tasks are revealed by 10,000 reward samples as in (Liu & Abbeel, 2021) , except for Maze, where a known goal is presented and used to set 𝑧 𝑟 directly. The code can be found at https://github.com/facebookresearch/controllable_agent . As toplines, we use online TD3 (with task rewards, and free environment interactions not restricted to a replay buffer), and offline TD3 (restricted to each replay buffer labelled with task rewards). Offline TD3 gives an idea of the best achievable performance given the training data in a buffer. In Fig. 2 we plot the performance of each method for each task in each environment, averaged over the three replay buffers and ten random seeds. Appendix H contains the full results and more plots. Compared to offline TD3 as a reference, on the Maze tasks, Lap, LRA-P, and FB perform well. On the Walker tasks, ICM, Trans, Lap, LRA-SR, and FB perform well. On the Cheetah tasks, ICM, Trans, Lap, LRA-SR, and FB perform well. On the Quadruped tasks, many methods perform well, including, surprisingly, random features. Appendix I plots some of the learned features on Maze. On Maze, none of the encoder-based losses learn good policies, contrary to FB and spectral SF methods. We believe this is because (𝑥, 𝑦) is already a good representation of the 2D state 𝑠, so the encoders do nothing. Yet SFs on (𝑥, 𝑦) cannot solve the task (SFs on rewards of the form 𝑎𝑥 + 𝑏𝑦 don't recover goal-oriented tasks): planning with SFs requires specific representations. Fig. 1 reports aggregated scores over all tasks. To average across tasks, we normalize scores: for each task and replay buffer, performance is expressed as a percentage of the performance of offline TD3 on the same replay buffer, a natural supervised topline given the data. These normalized scores are averaged over all tasks in each environment, then over environments to yield the scores in Fig. 1 . The variations over environments, replay buffers and random seeds are reported in Appendix H. These results are broadly consistent over replay buffers. Buffer-specific results are reported in Appendices H.3-H.4. Sometimes a replay buffer is restrictive, as attested by poor offline TD3 performance, starkly so for the Proto buffer on all Quadruped tasks. APS does not work well as a zero-shot RL method, but it does work well as an exploration method: on average, the APS and RND replay buffers have close results. FB and Lap are the only methods that perform consistently well, both over tasks and over replay buffers. Averaged on all tasks and buffers, FB reaches 81% of supervised offline TD3 performance, and 85% on the RND buffer. The second-best method is Lap with 74% (78% on the Proto buffer).

7. DISCUSSION AND LIMITATIONS

Can a few features solve many reward functions? Why are some features better? The choice of features is critical. Consider goal-reaching tasks: the obvious way to learn to reach arbitrary goal states via SFs is to define one feature per possible goal state (one-hot encoding). This requires |𝑆| features, and does not scale to continuous spaces. But much better features exist. For instance, with an size-𝑛 cycle 𝑆 = {0, . . . , 𝑛 -1} with actions that move left and right modulo 𝑛, then two features suffice instead of 𝑛: SFs with 𝜙(𝑠) = (cos(2𝜋𝑠/𝑛), sin(2𝜋𝑠/𝑛)) provides exact optimal policies to reach any arbitrary state. On a 𝑑-dimensional grid 𝑆 = {0, . . . , 𝑛 -1} 𝑑 , just 2𝑑 features (a sine and cosine in each direction) are sufficient to reach any of the 𝑛 𝑑 goal states via SFs. Goal-reaching is only a subset of possible RL tasks, but this clearly shows that some features are better than others. The sine and cosine are the main eigenfunctions of the graph Laplacian on the grid: such features have long been argued to play a special role in RL (Mahadevan & Maggioni, 2007; Machado et al., 2017) . FB-based methods are theoretically known to learn such eigenfunctions (Blier et al., 2021 ). Yet a precise theoretical link to downstream performance is still lacking. Are these finite-rank models reasonable? FB crucially relies on a finite-rank model of ∑︀ 𝛾 𝑡 𝑃 𝑡 𝜋 , while some SF variants above rely on finite-rank models of 𝑃 or the corresponding Laplacian. It turns out such approximations are very different for 𝑃 or for ∑︀ 𝛾 𝑡 𝑃 𝑡 𝜋 . Unfortunately, despite the popularity of low-rank 𝑃 assumptions in the theoretical literature, 𝑃 𝜋 is always close to Id in situations where 𝑠 𝑡+1 is close to 𝑠 𝑡 , such as any continuous-time physical system (Appendix D). Any low-rank model of 𝑃 𝜋 will be poor; actually 𝑃 𝜋 is better modeled as Idlow-rank, thus approximating the Laplacian. On the other hand, 𝑃 𝑡 𝜋 with large 𝑡 gets close to rank one under weak assumptions (ergodicity), as 𝑃 𝑡 𝜋 converges to an equilibrium distribution when 𝑡 → ∞. Thus the spectrum of the successor measures ∑︀ 𝛾 𝑡 𝑃 𝑡 𝜋 is usually more spread out (details and examples in Appendix D), and a low-rank model makes sense. The eigenvalues of 𝑃 𝜋 are close to 1 and there is little signal to differentiate between eigenvectors, but differences become clearer over time on 𝑃 𝑡 𝜋 . This may explain the better performance of FB and LRA-SR compared to LRA-P. Limitations. First, these experiments are still small-scale and performance is not perfect, so there is space for improvement. Also, all the environments tested here were deterministic: all algorithms still make sense in stochastic environments, but experimental conclusions may differ. Second, even though these methods can be coupled with any exploration technique, this will obviously not cover tasks too different from the actions in the replay buffer, as with any offline RL method. These zero-shot RL algorithms learn to summarize the long-term future for a wide range of policies (though without synthesizing trajectories). This is a lot: in contrast, world models only learn policyindependent one-step transitions. So the question remains of how far this can scale. A priori, there is no restriction on the inputs (e.g., images, state history...). Still, for large problems, some form of prior seems unavoidable. For SFs, priors can be integrated in 𝜙, but rewards must be linear in 𝜙. For FB, priors can be integrated in 𝐵's input. Touati & Ollivier (2021) use FB with pixel-based inputs for 𝐹 , but only the agent's (𝑥, 𝑦) position for 𝐵's input: this recovers all rewards that are functions of (𝑥, 𝑦) (linear or not). Breaking the symmetry of 𝐹 and 𝐵 reduces the strain on the model by restricting predictions to fewer variables, such as an agent's future state instead of the full environment.

8. CONCLUSIONS

Zero-shot RL methods avoid test-time planning by summarizing long-term state-state relationships for particular policies. We systematically tested forward-backward representations and many new models of successor features on zero-shot tasks. We also uncovered algebraic links between SFs, FB, contrastive learning, and spectral methods. Overall, SFs suffer from their dependency to basic feature construction, with only Laplacian eigenfunctions working reliably. Notably, planning with SFs requires specific features: SFs can fail with generic encoder-type feature learning, even if the learned representation of states is reasonable (such as (𝑥, 𝑦)). Forward-backward representations were best across the board, and provide reasonable zero-shot RL performance.

A SKETCH OF PROOF OF THEOREMS 1 AND 2

To provide an intuition behind Theorems 1 and 2 on SFs and FB, we include here a sketch of proof in the finite state case. Full proofs in the general case can be found in the Appendix of Touati & Ollivier (2021) . For Theorem 1 (SFs), let us assume that the reward is linear in the features 𝜙, namely, 𝑟(𝑠) = 𝜙(𝑠) ⊤ 𝑤 for some 𝑤 ∈ R 𝑑 . Then by definition, 𝑧 𝑟 = 𝑤 since 𝑧 𝑟 is the linear regression of the reward on the features (assuming features are linearly independent). Using the definition (4) of the successor features 𝜓, and taking the dot product with 𝑧 𝑟 , we obtain 𝜓(𝑠 0 , 𝑎 0 , 𝑧 𝑟 ) ⊤ 𝑧 𝑟 = E [︃ ∑︁ 𝑡 𝛾 𝑡 𝜙(𝑠 𝑡+1 ) ⊤ 𝑧 𝑟 | 𝑠 0 , 𝑎 0 , 𝜋 𝑧𝑟 ]︃ = E [︃ ∑︁ 𝑡 𝛾 𝑡 𝑟(𝑠 𝑡+1 ) | 𝑠 0 , 𝑎 0 , 𝜋 𝑧𝑟 ]︃ since 𝑟 = 𝜙 ⊤ 𝑤. This means that 𝜓(𝑠 0 , 𝑎 0 , 𝑧 𝑟 ) ⊤ 𝑧 𝑟 is the 𝑄-function of reward 𝑟 for policy 𝜋 𝑧𝑟 . At the same time, by the definition (4), 𝜋 𝑧𝑟 is defined as the argmax of 𝜓(𝑠 0 , 𝑎 0 , 𝑧 𝑟 ) ⊤ 𝑧 𝑟 . Therefore, the policy 𝜋 𝑧𝑟 is the argmax of its own 𝑄-function, meaning it is the optimal policy for reward 𝑟. For Theorem 2 (FB), let us assume that FB perfectly satisfies the training criterion (5), namely, 𝑀 𝜋𝑧 = 𝐹 ⊤ 𝑧 𝐵 diag(𝜌) in matrix form. Thanks to the definition (1) of successor representations 𝑀 𝜋 , for any policy 𝜋 𝑧 , the 𝑄-function for the reward 𝑟 can be written as 𝑄 𝜋𝑧 𝑟 = 𝑀 𝜋𝑧 𝑟 in matrix form. This is equal to 𝐹 ⊤ 𝑧 𝐵 diag(𝜌)𝑟. Thus, if we define 𝑧 𝑟 := 𝐵 diag(𝜌)𝑟 = E 𝑠∼𝜌 [𝐵(𝑠)𝑟(𝑠)], we obtain 𝑄 𝜋𝑧 𝑟 = 𝐹 ⊤ 𝑧 𝑧 𝑟 for any 𝑧 ∈ R 𝑑 . In particular, the latter holds for 𝑧 = 𝑧 𝑟 as well: 𝐹 ⊤ 𝑧𝑟 𝑧 𝑟 is the 𝑄-function of 𝜋 𝑧𝑟 . Again, the policies 𝜋 𝑧 are defined in (5) as the greedy policies of 𝐹 ⊤ 𝑧 𝑧, for any 𝑧. Therefore, 𝜋 𝑧𝑟 is the argmax of its own 𝑄-function. Hence, 𝜋 𝑧𝑟 is the optimal policy for the reward 𝑟.

B DERIVATION OF THE FORWARD-BACKWARD LOSS

Here we quickly derive the loss (8) used to train 𝐹 and 𝐵 such that 𝑀 𝜋 (𝑠, 𝑎, d𝑠 ′ ) ≈ 𝐹 (𝑠, 𝑎) ⊤ 𝐵(𝑠 ′ ) 𝜌(d𝑠 ′ ). Training is based on the Bellman equation satisfied by 𝑀 𝜋 . This holds separately for each policy parameter 𝑧, so in this section we omit 𝑧 for simplicity. Here 𝜌(d𝑠 ′ ) is the distribution of states in the dataset. Importantly, the resulting loss does not require to know this measure 𝜌(d𝑠 ′ ), only to be able to sample states 𝑠 ∼ 𝜌 from the dataset. The successor measure 𝑀 𝜋 satisfies a Bellman-like equation 𝑀 𝜋 = 𝑃 + 𝛾𝑃 𝜋 𝑀 𝜋 , as matrices in the finite case and as measures in the general case (Blier et al., 2021) . We can learn FB by iteratively minimizing the Bellman residual 𝑀 𝜋 -(𝑃 + 𝛾𝑃 𝜋 𝑀 𝜋 ) on the parametric model 𝑀 = 𝐹 ⊤ 𝐵𝜌. 𝑀 𝜋 (𝑠, 𝑎, d𝑠 ′ ) is a measure on 𝑠 ′ for each (𝑠, 𝑎), so it is not obvious how to measure the size of the Bellman residual. In general, we can define a norm on such objects 𝑀 by taking the density with respect to the reference measure 𝜌, ‖𝑀 ‖ 2 𝜌 := E (𝑠,𝑎)∼𝜌 𝑠 ′ ∼𝜌 [︃ (︂ 𝑀 (𝑠, 𝑎, d𝑠 ′ ) 𝜌(d𝑠 ′ ) )︂ 2 ]︃ where 𝑀 (𝑠,𝑎,d𝑠 ′ ) 𝜌(d𝑠 ′ ) is the density of 𝑀 with respect to 𝜌.foot_0 For finite states, 𝑀 is a matrix 𝑀 𝑠𝑎𝑠 ′ and this is just a 𝜌-weighted Frobenius matrix norm, ‖𝑀 ‖ 2 𝜌 = ∑︀ 𝑠𝑎𝑠 ′ 𝑀 2 𝑠𝑎𝑠 ′ 𝜌(𝑠, 𝑎)/𝜌(𝑠 ′ ). (This is also how we proceed to learn a low-rank approximation of 𝑃 in (16).) We define the loss on 𝐹 and 𝐵 as the norm of the Bellman residual on 𝑀 for the model 𝑀 = 𝐹 ⊤ 𝐵𝜌. As usual in temporal difference learning, we use fixed, non-trainable target networks F and B for the right-hand-side of the Bellman equation. Thus, the Bellman residual is 𝐹 ⊤ 𝐵𝜌 - (︀ 𝑃 + 𝛾𝑃 𝜋 F⊤ B𝜌 )︀ , and the loss is ℒ(𝐹, 𝐵) := ⃦ ⃦ 𝐹 ⊤ 𝐵𝜌 - (︀ 𝑃 + 𝛾𝑃 𝜋 F⊤ B𝜌 )︀⃦ ⃦ 2 𝜌 ( ) When computing the norm ‖•‖ 𝜌 , the denominator 𝜌 cancels out with the 𝐹 ⊤ 𝐵𝜌 terms, but we are left with a 𝑃/𝜌 term. This term can still be integrated, because integrating 𝑃 (d𝑠 ′ |𝑠, 𝑎)/𝜌(d𝑠 ′ ) under 𝑠 ′ ∼ 𝜌 is equivalent to directly integrating under 𝑠 ′ ∼ 𝑃 (d𝑠 ′ |𝑠, 𝑎), namely, integrating under transitions (𝑠 𝑡 , 𝑎 𝑡 , 𝑠 𝑡+1 ) in the environment. This plays out as follows: ℒ(𝐹, 𝐵) = E (𝑠𝑡.𝑎𝑡)∼𝜌 𝑠 ′ ∼𝜌 [︃ (︂ 𝐹 (𝑠 𝑡 , 𝑎 𝑡 ) ⊤ 𝐵(𝑠 ′ ) - 𝑃 (d𝑠 ′ |𝑠 𝑡 , 𝑎 𝑡 ) 𝜌(d𝑠 ′ ) -𝛾 E 𝑠𝑡+1∼𝑃 (d𝑠𝑡+1|𝑠𝑡,𝑎𝑡) [ F (𝑠 𝑡+1 , 𝜋(𝑠 𝑡+1 )) ⊤ B(𝑠 ′ )] )︂ 2 ]︃ (23) = E (𝑠𝑡,𝑎𝑡,𝑠𝑡+1)∼𝜌 𝑠 ′ ∼𝜌 [︂ (︁ 𝐹 (𝑠 𝑡 , 𝑎 𝑡 ) ⊤ 𝐵(𝑠 ′ ) -𝛾 F (𝑠 𝑡+1 , 𝜋(𝑠 𝑡+1 )) ⊤ B(𝑠 ′ ) )︁ 2 ]︂ -2 E (𝑠𝑡,𝑎𝑡,𝑠𝑡+1)∼𝜌 [︀ 𝐹 (𝑠 𝑡 , 𝑎 𝑡 ) ⊤ 𝐵(𝑠 𝑡+1 ) ]︀ + Const ( ) where Const is a constant term that we can discard since it does not depend on 𝐹 and 𝐵. Apart from the discarded constant term, all terms in this final expression can be sampled from the dataset. Note that we have a -2𝐹 (𝑠 𝑡 , 𝑎 𝑡 ) ⊤ 𝐵(𝑠 𝑡+1 ) term where Touati & Ollivier (2021) have a -2𝐹 (𝑠 𝑡 , 𝑎 𝑡 ) ⊤ 𝐵(𝑠 𝑡 ) term: this is because we define successor representations (1) using 𝑠 𝑡+1 while Touati & Ollivier (2021) use 𝑠 𝑡 . See also Appendix L for pseudocode (including the orthonormalization loss, and double networks for 𝑄-learning as described in Appendix G). The orthonormalization loss. An auxiliary loss is used to normalize 𝐵 so that E 𝑠∼𝜌 [𝐵(𝑠)𝐵(𝑠) ⊤ ] ≈ Id. This loss is ℒ norm (𝐵) := ⃦ ⃦ E 𝜌 [𝐵𝐵 ⊤ ] -Id ⃦ ⃦ 2 Frobenius (25) = E 𝑠∼𝜌, 𝑠 ′ ∼𝜌 [︁ (𝐵(𝑠) ⊤ 𝐵(𝑠 ′ )) 2 -‖𝐵(𝑠)‖ 2 2 -‖𝐵(𝑠 ′ )‖ 2 2 ]︁ + Const. (The more complex expression in Touati & Ollivier (2021) has the same gradients up to a factor 4.) The auxiliary loss ℒ ′ (9). Learning 𝐹 (𝑠, 𝑎, 𝑧) is equivalent to learning 𝐹 (𝑠, 𝑎, 𝑧) ⊤ 𝑧 ′ for all vectors 𝑧 ′ . Yet the definition of the policies 𝜋 𝑧 in FB only uses 𝐹 (𝑠, 𝑎, 𝑧) ⊤ 𝑧. Thus, as was done for SFs in Section 5.1, one may wonder if there is a scalar rather than vector loss to train 𝐹 , that would reduce the error in the directions of 𝐹 used to define 𝜋 𝑧 . In the case of FB, the full vector loss is needed to train 𝐹 and 𝐵. However, one can add the following auxiliary loss on 𝐹 to reduce the errors in the specific direction 𝐹 (𝑠, 𝑎, 𝑧) ⊤ 𝑧. This is obtained as follows. Take the Bellman gap in the main FB loss ( 22): this Bellman gap is 𝐹 ⊤ 𝐵𝜌 - (︀ 𝑃 + 𝛾𝑃 𝜋 F⊤ B𝜌 )︀ . To specialize this Bellman gap in the direction of 𝐹 (𝑠, 𝑎, 𝑧) ⊤ 𝑧, multiply by 𝐵 ⊤ (𝐵𝜌𝐵 ⊤ ) -1 𝑧 on the right: this yields 𝐹 ⊤ 𝑧 - (︀ 𝑃 𝐵 ⊤ (𝐵𝜌𝐵 ⊤ ) -1 𝑧 + 𝛾𝑃 𝜋 F⊤ 𝑧 )︀ (using that we compute the loss at B = 𝐵). This new loss is the Bellman gap on 𝐹 ⊤ 𝑧, with reward 𝑃 𝐵 ⊤ (𝐵𝜌𝐵 ⊤ ) -1 𝑧. This is the loss ℒ ′ described in (9). It is a particular case of the main FB loss: ℒ = 0 implies ℒ ′ = 0, since we obtained ℒ ′ by multiplying the Bellman gap of ℒ. We use ℒ ′ on top of the main FB loss to reduce errors in the direction 𝐹 (𝑠, 𝑎, 𝑧) ⊤ 𝑧. However, in the end, the differences are modest.

C RELATIONSHIP BETWEEN CONTRASTIVE LOSS AND SVD OF SUCCESSOR MEASURES

Here we prove the precise relationship between the contrastive loss (17) and SVDs of the successor measure of the exploration policy. Intuitively, both methods push states together if they lie on the same trajectory, by increasing the dot product between the representations of 𝑠 𝑡 and 𝑠 𝑡+𝑘 . This is formalized as follows. Let 𝜋 be the policy used to produce the trajectories in the dataset. Define M := (1 -𝛾 CL ) ∑︁ 𝑡≥0 𝛾 𝑡 CL 𝑃 𝑡+1 𝜋 ( ) where 𝛾 CL is the parameter of the geometric distribution used to choose 𝑘 when sampling 𝑠 𝑡 and 𝑠 𝑡+𝑘 . M is a stochastic matrix in the discrete case, and a probability measure over 𝑆 in the general case: it is the normalized version of the successor measure (3) with 𝜋 the exploration policy. By construction, the distribution of 𝑠 𝑡+𝑘 knowing 𝑠 𝑡 with 𝑘 ∼ Geom(1 -𝛾 CL ) is described by M . Therefore, we can rewrite the loss as -E 𝑘∼Geom(1-𝛾CL) (𝑠𝑡,𝑠 𝑡+𝑘 )∼𝒟 [︂ log exp(cosine(𝜙(𝑠 𝑡 ), 𝜇(𝑠 𝑡+𝑘 ))) E 𝑠 ′ ∼𝒟 exp(cosine(𝜙(𝑠 𝑡 ), 𝜇(𝑠 ′ ))) ]︂ (28) = -E 𝑠∼𝜌, 𝑠 ′ ∼𝜌 [︃ M (𝑠, d𝑠 ′ ) 𝜌(d𝑠 ′ ) log exp(cosine(𝜙(𝑠), 𝜇(𝑠 ′ ))) ]︃ + E 𝑠∼𝜌 [log E 𝑠 ′ ∼𝒟 exp(cosine(𝜙(𝑠), 𝜇(𝑠 ′ )))] Assume that 𝜙 and 𝜇 are centered with unit norm, namely, ‖𝜙(𝑠)‖ 2 = 1 and E 𝑠∼𝜌 𝜙(𝑠) = 0 and likewise for 𝜇. With unit norm, the cosine becomes just a dot product, and the loss is • • • = -E 𝑠∼𝜌, 𝑠 ′ ∼𝜌 [︃ M (𝑠, d𝑠 ′ ) 𝜌(d𝑠 ′ ) 𝜙(𝑠) ⊤ 𝜇(𝑠 ′ ) ]︃ + E 𝑠∼𝜌 [︀ log E 𝑠 ′ ∼𝒟 exp(𝜙(𝑠) ⊤ 𝜇(𝑠 ′ )) ]︀ . A second-order Taylor expansion provides log E exp 𝑋 = E 𝑋 + 1 2 E[𝑋 2 ] -1 2 (E 𝑋) 2 + 𝑂(|𝑋| 3 ) and therefore, with E 𝜙 = E 𝜇 = 0, the loss is approximately • • • ≈ -E 𝑠∼𝜌, 𝑠 ′ ∼𝜌 [︃ M (𝑠, d𝑠 ′ ) 𝜌(d𝑠 ′ ) 𝜙(𝑠) ⊤ 𝜇(𝑠 ′ ) ]︃ + 1 2 E 𝑠∼𝜌, 𝑠 ′ ∼𝜌 [︁ (︀ 𝜙(𝑠) ⊤ 𝜇(𝑠 ′ ) )︀ 2 ]︁ (32) = 1 2 E 𝑠∼𝜌, 𝑠 ′ ∼𝜌 ⎡ ⎣ (︃ 𝜙(𝑠) ⊤ 𝜇(𝑠 ′ ) - M (𝑠, d𝑠 ′ ) 𝜌(d𝑠 ′ ) )︃ 2 ⎤ ⎦ + Const (33) where the constant term does not depend on 𝜙 and 𝜇. This is minimized when 𝜙 ⊤ 𝜇 is the SVD of M /𝜌 in the 𝐿 2 (𝜌) norm.

D DO FINITE-RANK MODELS ON THE TRANSITION MATRIX AND ON SUCCESSOR MEASURES MAKE SENSE?

It turns out finite-rank models are very different for 𝑃 or for ∑︀ 𝛾 𝑡 𝑃 𝑡 𝜋 . In typical situations, the spectrum of 𝑃 is concentrated around 1 while that of ∑︀ 𝛾 𝑡 𝑃 𝑡 𝜋 is much more spread-out. Despite the popularity of low-rank 𝑃 in theoretical RL works, 𝑃 is never close to low-rank in continuous-time systems: then 𝑃 is actually always close to the identity. Generally speaking, 𝑃 cannot be low-rank if most actions have a small effect. By definition, for any feature function 𝜙, (𝑃 𝜋 𝜙)(𝑠) = E[𝜙(𝑠 𝑡+1 )|𝑠 𝑡 = 𝑠]. Intuitively, if actions have a small effect, then 𝑠 𝑡+1 is close to 𝑠 𝑡 , and 𝜙(𝑠 𝑡+1 ) ≈ 𝜙(𝑠 𝑡 ) for continuous 𝜙. This means that 𝑃 𝜋 𝜙 is close to 𝜙, so that 𝑃 𝜋 is close to the identity on a large subspace of feature functions 𝜙. In the theory of continuous-time Markov processes, the time-𝑡 transition kernel is given by 𝑃 𝑡 = 𝑒 𝑡𝐴 with 𝐴 the infinitesimal generator of the process (Levin et al., 2009, §20.1) (Øksendal, 1998, §8.1) , hence 𝑃 𝑡 is Id +𝑂(𝑡) for small timesteps 𝑡. In general, the transition matrix 𝑃 is better modeled as Id + low-rank, which corresponds to a low-rank model of the Markov chain Laplacian Id -𝑃 . On the other hand, though ∑︀ 𝛾 𝑡 𝑃 𝑡 𝜋 is never exactly low-rank (it is invertible), it has meaningful low-rank approximations under weak assumptions. For large 𝑡, 𝑃 𝑡 𝜋 becomes rank-one under weak assumptions (ergodicity), as it converges to the equilibrium distribution of the transition kernel. For large 𝛾, the sum ∑︀ 𝛾 𝑡 𝑃 𝑡 𝜋 is dominated by large 𝑡. Most eigenvalues of 𝑃 𝜋 are close to 1, but taking powers 𝑃 𝑡 𝜋 sharpens the differences between eigenvalues: with 𝛾 close to 1, going from 𝑃 𝜋 to ∑︀ 𝛾 𝑡 𝑃 𝑡 𝜋 = (Id -𝛾𝑃 𝜋 ) -1 changes an eigenvalue 1 -𝜀 into 1/𝜀. In short, on 𝑃 itself, there is little learning signal to differentiate between eigenvectors, but differences become visible over time. This may explain why FB works better than low-rank decompositions directly based on 𝑃 or the Laplacian. For instance, consider the nearest-neighbor random walk on a length-𝑛 cycle {0, 1, . . . , 𝑛 -1 mod 𝑛}, namely, moving in dimension 1. (This extends to any-dimensional grids.) The associated 𝑃 𝜋 is not low-rank in any reasonable sense: the corresponding stochastic matrix is concentrated around the diagonal, and many eigenvalues are close to 1. Precisely, the eigenvalues are cos(2𝑘𝜋/𝑛) with integer 𝑘 = {0, . . . , 𝑛/2}. This is ≈ 1 -2𝜋 2 (𝑘/𝑛) 2 when 𝑘 ≪ 𝑛. Half of the eigenvalues are between √ 2/2 and 1. However, when 𝛾 → 1, ∑︀ 𝛾 𝑡 𝑃 𝑡 𝜋 has one eigenvalue 1/(1 -𝛾) and the other eigenvalues are 1 1-cos(2𝑘𝜋/𝑛) ≈ 𝑛 2 /2𝜋 2 𝑘 2 with positive integer 𝑘: there is one large eigenvalue, then the others decrease like cst/𝑘 2 . With such a spread-out spectrum, a finite-rank model makes sense.

E WHICH FEATURES ARE ANALOGOUS BETWEEN LOW-RANK 𝑃 AND SFS?

Here we explain the relationship between a low-rank model of 𝑃 and successor features. More precisely, if transition probabilities from (𝑠, 𝑎) to 𝑠 ′ can be written exactly as 𝜒(𝑠, 𝑎) ⊤ 𝜇(𝑠 ′ ), then SFs with basic features 𝜙 := 𝜇 will provide optimal policies for any reward function (Theorem 3). Indeed, under the finite-rank model 𝑃 (d𝑠 ′ |𝑠, 𝑎) = 𝜒(𝑠, 𝑎) ⊤ 𝜇(𝑠 ′ )𝜌(d𝑠 ′ ), rewards only matter via the reward features E 𝑠 ′ ∼𝜌 𝜇(𝑠 ′ )𝑟(𝑠 ′ ): namely, two rewards with the same reward features have the same 𝑄-function, as the dynamics produces the same expected rewards. Then 𝑄-functions are linear in these reward features, and using successor features with 𝜙 := 𝜇 provides the correct 𝑄-functions, as follows. Theorem 3. Assume that 𝑃 (d𝑠 ′ |𝑠, 𝑎) = 𝜒(𝑠, 𝑎) ⊤ 𝜇(𝑠 ′ )𝜌(d𝑠 ′ ). Then successor features using the basic features 𝜙 := 𝜇 provide optimal policies for any reward function. This is why we use 𝜇 rather than 𝜒 for the SF basic features. This is also why we avoided the traditional notation 𝑃 (𝑠 ′ |𝑠, 𝑎) = 𝜙(𝑠, 𝑎) ⊤ 𝜇(𝑠 ′ ) often used for low-rank 𝑃 , which induces a conflict of notation with the 𝜙 in SFs, and suggests the wrong analogy. Meanwhile, 𝜒 plays a role more analogous to SFs' 𝜓, although for one-step transitions instead of multistep transitions as in SFs: 𝑄-functions are linear combinations of the features 𝜒. In particular, the optimal 𝑄-function for reward 𝑟 is 𝑄 ⋆ 𝑟 = 𝜒 ⊤ 𝑤 𝑟 for some 𝑤 𝑟 . But contrary to successor features, there is no simple correspondence to compute 𝑤 𝑟 from 𝑟. Proof. Let 𝜋 be any policy. On a finite space in matrix notation, and omitting 𝜌 for simplicity, the assumption 𝑃 = 𝜒 ⊤ 𝜇 implies 𝑃 𝜋 = 𝜒 ⊤ 𝜋 𝜇 where 𝜒 𝜋 (𝑠) := E 𝑎∼𝜋(𝑠) 𝜒(𝑠, 𝑎) are the 𝜋-averaged features. Then, 𝑄 𝜋 𝑟 = 𝑃 ∑︁ 𝑡≥0 𝛾 𝑡 𝑃 𝑡 𝜋 𝑟 (34) = 𝜒 ⊤ 𝜇 ∑︁ 𝑡≥0 𝛾 𝑡 (︀ 𝜒 ⊤ 𝜋 𝜇 )︀ 𝑡 𝑟 (35) = 𝜒 ⊤ (︁ ∑︀ 𝑡≥0 𝛾 𝑡 (︀ 𝜇𝜒 ⊤ 𝜋 )︀ 𝑡 )︁ 𝜇𝑟. ( ) Thus, 𝑄-functions are expressed as 𝑄 𝜋 𝑟 (𝑠, 𝑎) = 𝜒(𝑠, 𝑎) ⊤ 𝑤(𝜋, 𝑟) with 𝑤(𝜋, 𝑟) = (︁ ∑︀ 𝑡≥0 𝛾 𝑡 (𝜇𝜒 ⊤ 𝜋 ) 𝑡 )︁ 𝜇𝑟. Moreover, rewards only matter via 𝜇𝑟. Namely, two rewards with the same 𝜇𝑟 have the same 𝑄-function for every policy. In full generality on continuous spaces with 𝜌 again, the same holds with 𝜇(𝑠)𝜌(d𝑠) instead of 𝜇(𝑠), and E 𝑠∼𝜌 𝜇(𝑠)𝑟(𝑠) instead of 𝜇𝑟. Now, let 𝑟 be any reward function, and let 𝑟 ′ be its 𝐿 2 (𝜌)-orthogonal projection onto the space generated by the features 𝜇. By construction, 𝑟-𝑟 ′ is 𝐿 2 (𝜌)-orthogonal to 𝜇, namely, E 𝜌 𝜇(𝑟-𝑟 ′ ) = 0. So E 𝜌 𝜇𝑟 = E 𝜌 𝜇𝑟 ′ . Therefore, by the above, 𝑟 and 𝑟 ′ have the same 𝑄-function for every policy. By definition, 𝑟 ′ lies in the linear span of the features 𝜇. By Theorem 1, SFs with features 𝜙 = 𝜇 will provide optimal policies for 𝑟 ′ . Since 𝑟 and 𝑟 ′ have the same 𝑄-function for every policy, an optimal policy for 𝑟 ′ is also optimal for 𝑟.

F RELATIONSHIP BETWEEN LAPLACIAN EIGENFUNCTIONS AND LOW-RANK 𝑃 LEARNING

The loss ( 16) used to learn a low-rank model of the transition probabilities 𝑃 is an asymmetric version of the Laplacian eigenfunction loss ( 14) with 𝜆 = 1. Said equivalently, if we use the low-rank 𝑃 loss ( 16) constrained with 𝜒 = 𝜇 to learn a low-rank model of 𝑃 𝜋 instead of 𝑃 (with 𝜋 the exploration policy), then we get the Laplacian eigenfunction loss ( 14) with 𝜆 = 1. Indeed, set 𝜆 = 1 in ( 14). Assume that the distributions of 𝑠 𝑡 and 𝑠 𝑡+1 in the dataset are identical on average (this happens, e.g., if the dataset is made of long trajectories or if 𝜌 is close enough to the invariant distribution of the exploration policy). Then, in the Laplacian loss ( 14), the norms from the first term cancel those from the second, and the Laplacian loss simplifies to (14) = E (𝑠𝑡,𝑠𝑡+1)∼𝒟 [︁ ‖𝜙(𝑠 𝑡 ) -𝜙(𝑠 𝑡+1 )‖ 2 ]︁ + E 𝑠∼𝒟 𝑠 ′ ∼𝒟 [︀ (𝜙(𝑠) ⊤ 𝜙(𝑠 ′ )) 2 -‖𝜙(𝑠)‖ 2 2 -‖𝜙(𝑠 ′ )‖ 2 2 ]︀ (37) = E 𝑠𝑡∼𝒟 ‖𝜙(𝑠 𝑡 )‖ 2 + E 𝑠𝑡+1∼𝒟 ‖𝜙(𝑠 𝑡 )‖ 2 -2 E (𝑠𝑡,𝑠𝑡+1)∼𝒟 [︀ 𝜙(𝑠 𝑡 ) ⊤ 𝜙(𝑠 𝑡+1 ) ]︀ (38) + E 𝑠∼𝒟 𝑠 ′ ∼𝒟 [︀ (𝜙(𝑠) ⊤ 𝜙(𝑠 ′ )) 2 -‖𝜙(𝑠)‖ 2 2 -‖𝜙(𝑠 ′ )‖ 2 2 ]︀ (39) = -2 E (𝑠𝑡,𝑠𝑡+1)∼𝒟 [︀ 𝜙(𝑠 𝑡 ) ⊤ 𝜙(𝑠 𝑡+1 ) ]︀ + E 𝑠∼𝒟 𝑠 ′ ∼𝒟 [︀ (𝜙(𝑠) ⊤ 𝜙(𝑠 ′ )) 2 ]︀ . ( ) This is the same as the low-rank loss ( 16) if we omit actions 𝑎 and constrain 𝜒 = 𝜇. 

G EXPERIMENTAL SETUP

In this section we provide additional information about our experiments. Code snippets for the main losses are given in Appendix L. The full code can be found at https://github.com/facebookresearch/controllable_agent 

G.1 ENVIRONMENTS

All the environments considered in this paper are based on the DeepMind Control Suite (Tassa et al., 2018) . • Point-mass Maze: a 2-dimensional continuous maze with four rooms. The states are 4-dimensional vectors consisting of positions and velocities of the point mass (𝑥, 𝑦, 𝑣 𝑥 , 𝑥 𝑦 ), and the actions are 2-dimensional vectors. At test, we assess the performance of the agents on 20 goal-reaching tasks (5 goals in each room described by their (𝑥, 𝑦) coordinates). • Walker: a planar walker. States are 24-dimensional vectors consisting of positions and velocities of robot joints, and actions are 6-dimensional vectors. We consider 4 different tasks at test time: walker_stand reward is a combination of terms encouraging an upright torso and some minimal torso height, while walker_walk and walker_run rewards include a component encouraging some minimal forward velocity. walker_flip reward includes a component encouraging some mininal angular momentum. • Cheetah: a running planar biped. States are 17-dimensional vectors consisting of positions and velocities of robot joints, and actions are 6-dimensional vectors. We consider 4 different tasks at test time: cheetah_walk and cheetah_run rewards are linearly proportional to the forward velecity up to some desired values: 2 m/s for walk and 10 m/s for run. Similarly, walker_walk_backward and walker_run_backward rewards encourage reaching some minimal backward velocities. • Quadruped: a four-leg ant navigating in 3D space. States and actions are 78dimensional and 12-dimensional vectors, respectively. We consider 4 tasks at test time: quadruped_stand reward encourages an upright torso. quadruped_walk and quadruped_run include a term encouraging some minimal torso velecities. quadruped_walk includes a term encouraging some minimal height of the center of mass.

G.2 ARCHITECTURES

We use the same architectures for all methods. • The backward representation network 𝐵(𝑠) and the feature network 𝜙(𝑠) are represented by a feedforward neural network with three hidden layers, each with 256 units, that takes as input a state and outputs a 𝐿2-normalized embedding of radius √ 𝑑. • For both successor features 𝜓(𝑠, 𝑎, 𝑧) and forward network 𝐹 (𝑠, 𝑎, 𝑧), we first preprocess separately (𝑠, 𝑎) and (𝑠, 𝑧) by two feedforward networks with two hidden layers (each with 1024 units) to 512-dimentional space. Then we concatenate their two outputs and pass it into another 2-layer feedforward network (each with 1024 units) to output a 𝑑-dimensional vector. • For the policy network 𝜋(𝑠, 𝑧), we first preprocess separately 𝑠 and (𝑠, 𝑧) by two feedforward networks with two hidden layers (each with 1024 units) to 512-dimentional space. Then we concatenate their two outputs and pass it into another 2-layer feedforward network (each with 1024 units) to output to output a 𝑑 𝐴 -dimensional vector, then we apply a Tanh activation as the action space is [-1, 1] 𝑑 𝐴 . For all the architectures, we apply a layer normalization (Ba et al., 2016) and Tanh activation in the first layer in order to standardize the states and actions. We use Relu for the rest of layers. We also pre-normalized 𝑧: 𝑧 ← √ 𝑑 𝑧 ‖𝑧‖2 in the input of 𝐹 , 𝜋 and 𝜓. Empirically, we observed that removing preprocessing and pass directly a concatenation of (𝑠, 𝑎, 𝑧) directly to the network leads to unstable training. The same holds when we preprocess (𝑠, 𝑎) and 𝑧 instead of (𝑠, 𝑎) and (𝑠, 𝑧), which means that the preprocessing of 𝑧 should be influenced by the current state. For maze environments, we added an additional hidden layer after the preprocessing (for both policy and forward / successor features) as it helped to improve the results.

G.3 SAMPLING OF 𝑧

We mix two methods for sampling 𝑧: 1. We sample 𝑧 uniformly in the sphere of radius √ 𝑑 in R 𝑑 (so each component of 𝑧 is of size ≈ 1). 2. We sample 𝑧 using the formula for 𝑧 𝑟 corresponding to the reward for reaching a random goal state 𝑠 in Theorems 1-2. Namely, we set 𝑧 = 𝐵(𝑠) for FB and 𝑧 = (︀∑︀ 𝑚 𝑖=1 𝜙(𝑠 𝑖 )𝜙(𝑠 𝑖 ) ⊤ )︀ + 𝜙(𝑠) for SFs, where 𝑠 ∼ 𝜌 is a random state sampled from the replay buffer and 𝑠 𝑖 are states in a minibatch. For the main series of results, we used a 50% mix ratio for those two methods. Different algorithms can benefit from different ratios: this is explored in Appendix G.4 LEARNING THE POLICIES 𝜋 𝑧 : POLICY NETWORK As the action space is continuous, we could not compute the arg max over action in closed form. Instead, we consider a latent-conditioned policy network 𝜋 𝜂 : 𝑆 × 𝑍 → 𝐴, and we learn the policy parameters 𝜂 by performing stochastic gradient ascent on the objective E 𝑠,𝑧 [𝐹 (𝑠, 𝜋 𝜂 (𝑠), 𝑧) ⊤ 𝑧] for FB or E 𝑠,𝑧 [𝜓(𝑠, 𝜋 𝜂 (𝑠), 𝑧) ⊤ 𝑧] for SFs. We also incorporate techniques introduced in the TD3 paper (Fujimoto et al., 2018) to address function approximation error in actor-critic methods: double networks and target policy smoothing, adding noise 𝜀 to the actions. Let 𝜃 1 and 𝜃 2 the parameters of two forward networks and let 𝜔 the parameters of the backward network. Let 𝜃 - 1 , 𝜃 - 2 and 𝜔 -be the parameters of their corresponding target networks. Let {(𝑠 𝑖 , 𝑎 𝑖 , 𝑠 next 𝑖 )} 𝑖∈𝐼 ⊂ 𝒟 a mini-batch of size |𝐼| = 𝑏 of transitions and let {𝑧 𝑖 } 𝑖∈𝐼 a mini-batch of size |𝐼| = 𝑏 of latent variables sampled according to G.3. The empirical version of the main FB loss in ( 8) is (with an additional 1/2 factor): L (𝜃 𝑘 , 𝜔) = 1 2𝑏(𝑏 -1) ∑︁ 𝑖,𝑗∈𝐼 2 𝑖̸ =𝑗 (︂ 𝐹 𝜃 𝑘 (𝑠 𝑖 , 𝑎 𝑖 , 𝑧 𝑖 ) ⊤ 𝐵 𝜔 (𝑠 next 𝑗 ) -𝛾 min 𝑙=1,2 𝐹 𝜃 - 𝑙 (𝑠 next 𝑖 , 𝜋 𝜂 (𝑠 next 𝑖 ) + 𝜀 𝑖 , 𝑧 𝑖 ) ⊤ 𝐵 𝜔 -(𝑠 next 𝑗 ) )︂ 2 - 1 𝑏 ∑︁ 𝑖∈𝐼 𝐹 𝜃 𝑘 (𝑠 𝑖 , 𝑎 𝑖 , 𝑧 𝑖 ) ⊤ 𝐵 𝜔 (𝑠 next 𝑖 ) ∀𝑘 = 1, 2 where 𝜀 𝑖 is sampled from a truncated centered Gaussian with variance 𝜎 2 (for policy smoothing). The empirical version of the auxiliary 𝐹 loss in ( 9) is: for 𝑘 = 1, 2, L ′ (𝜃 𝑘 ) = 1 𝑏 ∑︁ 𝑖∈𝐼 (︂ 𝐹 𝜃 𝑘 (𝑠 𝑖 , 𝑎 𝑖 , 𝑧 𝑖 ) ⊤ 𝑧 𝑖 -𝐵 𝜔 (𝑠 next 𝑖 ) ⊤ Cov + 𝑧 𝑖 -𝛾 min 𝑙=1,2 𝐹 𝜃 - 𝑙 (𝑠 next 𝑖 , 𝜋 𝜂 (𝑠 next 𝑖 , 𝑧) + 𝜀 𝑖 , 𝑧 𝑖 ) ⊤ 𝑧 𝑖 )︂ 2 (42) where Cov + is the pseudo-inverse of the empirical covariane matrix Cov = 1 𝑏 ∑︀ 𝑖∈𝐼 𝐵 𝜔 (𝑠 𝑖 )𝐵 𝜔 (𝑠 𝑖 ) ⊤ . We use 1/𝑑 as regularization coefficient in front of L ′ (𝜃 𝑘 ). For policy training, the empirical loss is: L (𝜂) = - 1 𝑏 ∑︁ 𝑖∈𝐼 min 𝑙=1,2 𝐹 𝜃 𝑙 (𝑠 𝑖 , 𝜋 𝜂 (𝑠 𝑖 , 𝑧) + 𝜀 𝑖 , 𝑧 𝑖 ) ⊤ 𝑧 𝑖 The same techniques are also used for SFs.

G.5 HYPERPARAMETERS

Table 1 summarizes the hyperparameters used in our experiments. A hyperparameter sensitivity analysis for two domains (Walker and Cheetah) is included in Appendix J. Hyperparameter tuning. Since all methods share the same core, we chose to use the same hyperparameters rather than tune per method, which could risk leading to less robust conclusions. We did not do hyperparameter sweeps for each baseline and task, first because this would have been too intensive given the number of setups, and, second, this would be too close to going back to a supervised method for each task. Instead, to avoid any overfitting, we tuned architectures and hyperparameters by hand on the Walker environment only, with the RND replay buffer, and reused these parameters across all methods and tasks (except for Maze, on which all methods behave differently). We identified some trends by monitoring learning curves and downstream performance on Walker, and we fixed a configuration that led to overall good performance for all the methods. We avoided a full sweep to avoid overfitting based on Walker, to focus on robustness. For instance, the learning rate 10 -4 seemed to work well with all methods, as seen in Appendix J. Some trends were common between all methods: indeed, all the methods share a common core (training of the successor features 𝜓 or 𝐹 , and training of the policies 𝜋) and differ by the training of the basic features 𝜙 or 𝐵. For the basic features 𝜙, the various representation learning losses were easy to fit and had low hyperparameter sensitivity: we always observed smooth decreasing of losses until convergence. The challenging part was learning 𝜓 and 𝐹 and their corresponding policies, which is common across methods.

H DETAILED EXPERIMENTAL RESULTS

We report the full experimental results, first as a table (Section H.1). In Section H.2 we plot aggregated results for easier interpretation: first, aggregated across all tasks with a plot of the variability for each method; second, aggregated over environments (since each environment corresponds to a trained zero-shot model); third, for each individual task but still aggregated over replay buffers. In Section H.3 we plot all individual results per task and replay buffer. In Section H.4 we plot aggregate results split by replay buffer. H.1 FULL  736±3 532±20 710±8 - quadruped jump 770±7 474±40 176±13 663±20 806±14 490±48 447±31 326±72 731±9 651±8 - run 465±4 415±10 98±13 418±10 478±8 399±26 301±12 263±40 461±6 429±3 - stand 919±19 770±28 426±45 830±31 973±2 720±27 552±32 529±70 944±11 815±2 - walk 586±28 486±37 90±13 527±21 552±33 410±30 310±14 210±42 516±30 528±10 - walker flip 267±28 332±18 461±9 34±3 399±19 569±30 512±27 50±9 454±23 578±10 - run 96±8 167±12 251±11 47±22 250±6 299±20 325±10 38±5 350±14 388±8 - stand 516±36 733±22 813±10 191±62 853±18 836±38 904±21 326±41 828±20 890±15 - walk 152±35 457±33 518±74 30±2 607±31 748±75 818±36 53±10 853±15 760±19 - Table 2 : Score of each method, split by task and replay buffer. Average over ten random seeds, with ±1𝜎 estimated standard deviation on this average estimator. For Maze, we report the average over the 20 goals defined in the environment. We highlight the three leading methods (four when confidence intervals overlap) for each task.

H.2 AGGREGATE PLOTS OF RESULTS

Here we plot the results, first averaged over everything, then by environment averaged over the tasks of that environment, and finally by task. AEnc, Trans, FB, LRA-P and CL all recover a picture of the maze. For the other methods the picture is less clear (notably, Rand gets a near-circle, possibly as an instance of concentration of measure theorems). Whether t-SNE preserves the shape of the maze does not appear to be correlated to performance: AEnc and Trans learn nice features but perform poorly, while the t-SNE of Lap is not visually interpretable but performance is good. 

K GOAL-ORIENTED BASELINES

Goal-oriented methods such as universal value functions (UVFs) (Schaul et al., 2015) learn a 𝑄function 𝑄(𝑠, 𝑎, 𝑔) indexed by a goal description 𝑔. When taking for 𝑔 the set of all possible target states 𝑠 ′ ∈ 𝑆, these methods can learn to reach arbitrary target states. For instance, Ma et al. (2020) use a mixture of universal SFs and UVFs to learn goal-conditioned agents and policies. In principle, such goal-oriented methods are not designed to deal with dense rewards, which are linear combinations of goals 𝑔. Indeed, the optimal 𝑄-function for a linear combination of goals is not the linear combination of the optimal 𝑄-functions. Nevertheless, we may still try to see how such linear combinations perform. The linear combination may be applied at the level of the 𝑄-functions, or at the level of some goal descriptors, as follows. Here, as an additional baseline for our experiments, we test a slight modification of the scheme from Ma et al. (2020) , as suggested by one reviewer. We learn two embeddings (𝜓, 𝑤) using state-reaching tasks. The 𝑄-function for reaching state 𝑠 ′ (goal 𝑔 = 𝑠 ′ ) is modeled as 𝑄(𝑠 𝑡 , 𝑎 𝑡 , 𝑠 ′ ) = 𝜓(𝑠 𝑡 , 𝑎 𝑡 , 𝑤(𝑠 ′ )) ⊤ 𝑤(𝑠 ′ ). trained for reward 1 𝑠𝑡+1=𝑠 ′ . A policy network 𝜋(𝑠, 𝑤(𝑠 ′ )) outputs the action at 𝑠 for reaching goal 𝑠 ′ . Thus, the training loss is ℒ(𝜓, 𝑤) := E (𝑠𝑡,𝑎𝑡,𝑠𝑡+1)∼𝜌 𝑠 ′ ∼𝜌 [︀ 𝜓(𝑠 𝑡 , 𝑎 𝑡 , 𝑤(𝑠 ′ )) ⊤ 𝑤(𝑠 ′ ) -1 {𝑠𝑡+1=𝑠 ′ } -𝛾 ψ(𝑠 𝑡+1 , 𝜋(𝑠 𝑡+1 , w(𝑠 ′ )), w(𝑠 ′ )) ⊤ w(𝑠 ′ ) ]︀ 2 Similarly to the other baselines, the policy network 𝜋(𝑠, 𝑤(𝑠 ′ )) is trained by gradient ascent on the policy parameters to maximize E 𝑠∼𝜌 𝑠 ′ ∼𝜌 [𝑄(𝑠, 𝜋(𝑠, 𝑤(𝑠 ′ )), 𝑠 ′ )] = E 𝑠∼𝜌 𝑠 ′ ∼𝜌 [︀ 𝜓(𝑠, 𝜋(𝑠, 𝑤(𝑠 ′ )), 𝑤(𝑠 ′ )) ⊤ 𝑤(𝑠 ′ ) ]︀ At test time, given a reward 𝑟, we proceed as in FB and estimate 𝑧 = E 𝑠∼𝜌 [𝑟(𝑠)𝑤(𝑠)] and use policy 𝜋(𝑠, 𝑧). This amounts to extending from goal-reaching tasks to dense tasks by linearity on 𝑤, even though in principle, this method should only optimize for single-goal rewards. We use the same architectures for (𝜓, 𝑤) as for (𝐹, 𝐵), with double networks 𝜓 1 , 𝜓 2 and policy smoothing as in (43). The sparse reward 1 {𝑠𝑡+1=𝑠 ′ } could suffer from large variance in continuous state spaces. We mitigate this either by: • biasing the sampling of 𝑠 ′ by setting 𝑠 ′ to 𝑠 𝑡+1 half of the time. • replacing the reward by a less sparse one, 1 {‖𝑠𝑡+1-𝑠 ′ ‖2≤𝜀} . Table 4 reports normalized scores for each domain, for the two variants just described, trained on the RND replay buffer, averaged over tasks and over 10 random seeds. The first variant scores 84% on Maze, 62% on Quadruped tasks, 17% on Walker tasks, and 2% on Cheetah tasks. The second variant is worse overall. So overall, this works well on Maze (as expected, since this is a goal-oriented problem), moderately well on Quadruped tasks (where most other methods work well), and poorly on the other environments. This is expected, as this method is not designed to handle dense combinations of goals. Relationship with FB. The use of universal SFs in Ma et al. (2020) is quite different from the use of universal SFs in Barreto et al. (2017) and Borsa et al. (2018) . The latter is mathematically related to FB, as described at the end of Section 4. The former uses SFs as an intermediate tool for a goal-oriented model, and is more distantly related to FB (notably, it is designed to deal with single goals, not linear combinations of goals such as dense rewards). A key difference between FB and goal-oriented methods is the following. Above, we use 𝜓(𝑠, 𝑎, 𝑤(𝑔)) ⊤ 𝑤(𝑔) as a model of the optimal 𝑄-function for the policy with goal 𝑔. This is reminiscent of 𝐹 (𝑠, 𝑎, 𝑧) ⊤ 𝐵(𝑔) with 𝑧 = 𝐵(𝑔), the value of 𝑧 used to reach 𝑔 in FB. However, even if we restrict FB to goal-reaching by using 𝑧 = 𝐵(𝑔) only (a significant restriction), then FB learns 𝐹 (𝑠, 𝑎, 𝐵(𝑔)) ⊤ 𝐵(𝑔 ′ ) to model the successor measure, i.e., the number of visits to each goal 𝑔 ′ for the policy with goal 𝑔, starting at (𝑠, 𝑎). Thus, FB learns an object indexed by (𝑠, 𝑎, 𝑔, 𝑔 ′ ) for all pairs (𝑔, 𝑔 ′ ). Thus, FB learns more information (it models successor measures instead of 𝑄-functions), and allows for recovering linear combinations of goals in a principled way. Meanwhile, even assuming perfect neural network optimization in goal-reaching methods, there is no reason the goal-oriented policies would be optimal for arbitrary linear combinations of goals, only for single goals. 



This is the dual norm on measures of the 𝐿 (𝜌) norm on functions. It amounts to learning a model of 𝑀 (𝑠, 𝑎, d𝑠 ′ ) by learning relative densities to reach 𝑠 ′ knowing we start at (𝑠, 𝑎), relative to the average density 𝜌(d𝑠 ′ ) in the dataset. next_phi = agent.feature_net(next_obs)



Figure2: Zero-shot scores for each task, with supervised online and offline TD3 as toplines. Average over 3 replay buffers and 10 random seeds.

Figure 3: Maze, Walker, Cheetah and Quadruped environments used in our experiments. In the Mmaze domain (left), we show an example of an initial state (yellow point) and the 20 test goals (red circles).

Figure4: Zero-shot scores of ten SF methods and FB, aggregated over tasks using normalized scores as described in the text. To assess variability, the box plot on the right shows the variations of the distribution of normalized scores over random seeds, environments, and replay buffers.

Figure12: Visualization of embedding vectors obtained by each method (𝜙 for SF and 𝐵 for FB) on the maze domain after projecting them in two-dimensional space with t-SNE .

target_F1, target_F2 = agent.forward_target_net(next_obs, z, next_action) # batch x z_dim 10 target_B = agent.backward_target_net(next_obs) # batch x z_dim 11 target_M1, target_M2 = [torch.einsum('sd, td -> st', target_Fi, target_B) for target_Fi in [F1, F2]] # batch x batch 12 target_M = torch.min(target_M1, target_M2) = agent.forward_net(obs, z, action) 17 B = agent.backward_net(next_obs) 18 M1, M2 = [torch.einsum('sd, td -> st', Fi, B) for Fi in [F1, F2]] # batch x batch 19 I = torch.eye( * M1.size(), device=M1.device) 20 off_diag = ~I.bool() 21 fb_offdiag: tp.Any = 0.5 * sum((M -discount * target_M)[off_diag].pow (2).mean() for M in [M1, M2]) 22 fb_diag: tp.Any = -sum(M.diag().mean() for M in [M1, M2]) .min(next_Q1, nextQ2) 29 cov = torch.matmul(B.T, B) / B.shape[0] 30 inv_cov = torch.linalg.pinv(cov) 31 implicit_reward = (torch.matmul(B, inv_cov) * z).sum(dim=1) # batch_size 32 target_Q = implicit_reward.detach() + discount * next_Q # batch_size 33 34 Q1, Q2 = [torch.einsum('sd, sd -> s', Fi, z) for Fi in [F1, F2]] 35 q_loss = F.mse_loss(Q1, target_Q) + F.mse_loss(Q2, target_Q) Pytorch code for FB training loss 1 def compute_sf_loss(agent, obs, action, next_obs, z, discount): (mu=mu, stddev=agent.cfg.stddev, clip= agent.cfg.stddev_clip) 7 next_F1, next_F2 = agent.successor_target_net(next_obs, z, next_action) # batch x z_dim 8 target_phi = agent.feature_net(next_goal).detach() # batch x z_dim 9 next_Q1, next_Q2 = [torch.einsum('sd, sd -> s', next_Fi, z) for next_Fi in [next_F1, next_F2]] 10 next_Q = torch.min(next_Q1, next_Q2) 11 target_Q = torch.einsum('sd, sd -> s', target_phi, z) + discount * next_Q 12 13 F1, F2 = agent.successor_net(obs, z, action) 14 Q1, Q2 = [torch.einsum('sd, sd -> s', Fi, z) for Fi in [F1, F2]] 15 sf_loss = F.mse_loss(Q1, target_Q) + F.mse_loss(Q2, target_Q) Pytorch code for Laplacian Eigenfunctions Lap loss 1 def compute_phi_loss(agent, obs, future_obs): .einsum('sd, td-> st', mu, future_phi) # batch x batch 8 I = torch.eye( * logits.size(), device=logits.device) Pytorch code for the contrastive CL loss 1 def compute_phi_loss(agent, obs, action, next_obs): 2 3 phi = agent.feature_net(obs) predicted_action = agent.inverse_dynamic_net(torch.cat([phi, next_phi].matmul(phi, phi.T) 12 I = torch.eye( * Cov.size(), device=Cov.device) 13 off_diag = ~I.bool() 14 orth_loss_diag = -2 * Cov.diag().mean() return loss 20 Listing 9: Pytorch code for LRA-P loss 1 def compute_phi_loss(agent, obs, action, next_obs, discount): -discount * target_SR.detach())[off_diag].pow(2).mean() 15 16 # compute orthonormality loss 17 Cov = torch.matmul(phi, phi.T) 18 orth_loss_diag = -2 * Cov.diag().mean() 19 orth_loss_offdiag = Cov[off_diag].pow(2).mean()

Hyperparameters used in our experiments.

TABLE OF RESULTS

Normalized score for each domain of two variants of USF fromMa et al. (2020) trained on RND replay buffer, averaged over tasks and 10 random seeds

ACKNOWLEDGEMENTS

The authors would like to thank Olivier Delalleau, Armand Joulin, Alessandro Lazaric, Sergey Levine, Matteo Pirotta, Andrea Tirinzoni, and the anonymous reviewers for helpful comments and questions on the research and manuscript.

LIST OF APPENDICES

Appendix A sketches a proof of the basic properties of SFs and the FB representation, Theorems 1 and 2, respectively from Borsa et al. (2018) and Touati & Ollivier (2021) .Appendix B derives the loss (8) we use to learn 𝐹 and 𝐵, and the auxiliary orthonormalization loss.Appendix C describes the precise mathematical relationship between the contrastive loss (17) and successor measures of the exploration policy.Appendix D further discusses why low-rank models make more sense on successor measures than on the transition matrix 𝑃 .Appendix E proves that if 𝑃 is low-rank given by 𝜒 ⊤ 𝜇, then successor features using 𝜙 = 𝜇 (but not 𝜒) provide optimal policies. Appendix F proves that the loss ( 16) used to learn low-rank 𝑃 is an asymmetric version of the Laplacian eigenfunction loss ( 14).Appendix G describes the detailed experimental setup: environments, architectures, policy training, methods for sampling of 𝑧, and hyperparameters. The full code can be found at https://github.com/facebookresearch/controllable_agent Appendix H contains the full table of experimental results, as well as aggregate plots over several variables (per task, per replay buffer, etc.).Appendix I analyzes the learned features. We use feature rank analysis to study the degree of feature collapse for some methods. We also provide t-SNE visualizations of the learned embeddings for all methods (for the Maze environment).Appendix J provides hyperparameter sensitivity plots (latent dimension 𝑑, learning rate, batch size, mixing ratio for 𝑧 sampling).Appendix K describes a further baseline where we take a goal-oriented method inspired from Ma et al. (2020) , and extend it to dense rewards by linearity. Since results were very poor except for Maze (which is goal-oriented), we did not discuss it in the main text.Appendix L provides PyTorch snippets for the key losses, notably the FB loss, the SF loss as well as the various feature learning methods for SF.

H.4 INFLUENCE OF THE REPLAY BUFFER

Here we plot the influence of the replay buffer, by reporting results separated by replay buffer, but averaged over the tasks corresponding to each environment (Fig 10) .Overall, there is a clear failure case of the Proto buffer on the Quadruped environment: the TD3 supervised baseline performs poorly for all tasks in that environment.Otherwise, results are broadly consistent on the different replay buffers: with a few exceptions, the same methods succeed or fail on the same environments. Published as a conference paper at ICLR 2023In Fig. 11 , we plot results aggregated over all tasks but split by replay buffer. Box plots further show variability within random seeds and environments for a given replay buffer.Overall method rankings are broadly consistent between RND and APS, except for CL. Note that the aggregated normalized score for Proto is largely influenced by the failure on Quadruped: normalization by a very low baseline (Fig. 10 , Quadruped plot for Proto), somewhat artificially pushes Trans high up (this shows the limit of using scores normalized by offline TD3 score), while the other methods' rankings are more similar to RND and APS.FB and Lap work very well in all replay buffers, with LRA-SR and Trans a bit behind due to their feailures on some tasks. 

I ANALYSIS OF THE LEARNED FEATURES I.1 FEATURE RANK

Here we test the hypothesis that feature collapse for some methods is responsible for some cases of bad performance. This is especially relevant for Maze, where some methods may have little incentive to learn more features beyond the original two features (𝑥, 𝑦).We report in Table 3 the effective rank of learned features 𝐵 or 𝜙: this is computed as in (Lyle et al., 2022) , as the fraction of eigenvalues of E 𝑠∼𝜌 𝐵(𝑠)𝐵(𝑠) ⊤ or E 𝑠∼𝜌 𝜙(𝑠)𝜙(𝑠) ⊤ above a certain threshold. , trained on RND replay buffer and averaged over 10 random seeds. We use 𝑛 = 100, 000 samples to estimate the covariance, and 𝜀 = 10 -4 .For most methods and all environments except Maze, more than 90% (often 100%) of eigenvalues are above 10 -4 , so the effective rank is close to full.The Maze environment is a clear exception: on Maze, for the Rand, AEnc, Trans, Latent and ICM methods, only about one third of the eigenvalues are above 10 -4 . The other methods keep 100% of the eigenvalues above 10 -4 . This is perfectly aligned with the performance of each method on Maze.So rank reduction does happen for some methods. This reflects the fact that two features (𝑥, 𝑦) already convey the necessary information about states and dynamics, but are not sufficient to solve the problem via successor features. In the Maze environment, auto-encoder or transition models can perfectly optimize their loss just by keeping the original two features (𝑥, 𝑦), and they have no incentive to learn other features, so the effective rank could have been 2. These methods may benefit from auxiliary losses to prevent eigenvalue collapse, similar to the orthonormalization loss used for 𝐵. We did not include such losses, because we wanted to keep the same methods (autoencoders, ICM, transition model...) used in the literature. But even with an auxiliary loss, keeping a full rank could be achieved just by keeping (𝑥, 𝑦) and then blowing up some additional irrelevant features.

L PSEUDOCODE OF TRAINING LOSSES

Here we provide PyTorch snippets for the key losses, notably the FB loss, SF loss as well as the various feature learning methods for SF. 

