PROTO-VALUE NETWORKS: SCALING REPRESENTA-TION LEARNING WITH AUXILIARY TASKS

Abstract

Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably well-understood; in practice, however, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treated as an essentially infinite source of information about the environment. Based on this observation, we study the effectiveness of auxiliary tasks for learning rich representations, focusing on the setting where the number of tasks and the size of the agent's network are simultaneously increased. For this purpose, we derive a new family of auxiliary tasks based on the successor measure. These tasks are easy to implement and have appealing theoretical properties. Combined with a suitable off-policy learning rule, the result is a representation learning algorithm that can be understood as extending Mahadevan & Maggioni (2007)'s proto-value functions to deep reinforcement learning -accordingly, we call the resulting object proto-value networks. Through a series of experiments on the Arcade Learning Environment, we demonstrate that proto-value networks produce rich features that may be used to obtain performance comparable to established algorithms, using only linear approximation and a small number (~4M) of interactions with the environment's reward function.

1. INTRODUCTION

In deep reinforcement learning (RL), an agent maps observations to a policy or return prediction by means of a neural network. The role of this network is to transform observations into a series of successively refined features, which are linearly combined by the final layer into the desired prediction. A common perspective treats this transformation and the intermediate features it produces as the agent's representation of its current state. Under this lens, the learning agent performs two tasks simultaneously: representation learning, the discovery of useful state features; and credit assignment, the mapping from these features to accurate predictions. Although end-to-end RL has been shown to obtain good performance in a wide variety of problems (Mnih et al., 2015; Levine et al., 2016; Bellemare et al., 2020) , modern RL methods typically incorporate additional machinery that incentivizes the learning of good state representations: for example, predicting immediate rewards (Jaderberg et al., 2017) , future states (Schwarzer et al., 2021a) , or observations (Gelada et al., 2019) ; encoding a similarity metric (Castro, 2020; Agarwal et al., 2021a; Zhang et al., 2021) ; and data augmentation (Laskin et al., 2020) . In fact, it is often possible, and desirable, to first learn a sufficiently rich representation with which credit assignment can then be efficiently performed; in that sense, representation learning has been a core aspect of RL from its early days (Sutton & Whitehead, 1993; Sutton, 1996; Ratitch & Precup, 2004; Mahadevan & Maggioni, 2007; Diuk et al., 2008; Konidaris et al., 2011; Sutton et al., 2011) . An effective method for learning state representations is to have the network predict a collection of auxiliary tasks associated with each state (Caruana, 1997; Jaderberg et al., 2017; Chung et al., 2019) . In an idealized setting, auxiliary tasks can be shown to induce a set of features that correspond to the principal components of what is called the auxiliary task matrix (Bellemare et al., 2019; Lyle et al., 2021; Le Lan et al., 2022a) . This makes it possible to analyze the theoretical approximation error (Petrik, 2007; Parr et al., 2008) , generalization (Le Lan et al., 2022b) , and stability (Ghosh & Bellemare, 2020) of the learned representation. Perhaps surprisingly, there is comparatively little that is known about their empirical behaviour on larger-scale environments. In particular, the scaling properties of representation learning from auxiliary tasks -i.e., the effect of using more tasks, or increasing network capacity -remain poorly understood. This paper aims to fill this knowledge gap. Our approach is to construct a family of auxiliary rewards that can be sampled and subsequently. Specifically, we implement the successor measure (Blier et al., 2021; Touati & Ollivier, 2021) , which extends the successor representation (Dayan, 1993) by replacing state-equality with set-inclusion. In our case, these sets are defined implicitly by a family of binary functions over states. We conduct most of our studies on binary functions derived from randomly-initialized networks, whose effectiveness as random cumulants has already been demonstrated (Dabney et al., 2021) . Although our results may hold for other types of auxiliary rewards, our method has a number of benefits: it can be trivially scaled by sampling more random networks to serve as auxiliary tasks, it directly relates to the binary reward functions common of deep RL benchmarks, and can to some extent be theoretically understood. The actual auxiliary tasks consist in predicting the expected return of the random policy for their corresponding auxiliary rewards; in the tabular setting, this corresponds to proto-value functions (Mahadevan & Maggioni, 2007; Stachenfeld et al., 2014; Machado et al., 2018) . Consequently, we call our method proto-value networks (PVN). We study the effectiveness of this method on the Arcade Learning Environment (ALE) (Bellemare et al., 2013) . Overall, we find that PVN produces state features that are rich enough to support linear value approximations that are comparable to those of DQN (Mnih et al., 2015) on a number of games, while only requiring a fraction of interactions with the environment reward function. We explore the features learned by PVN and show that they capture the temporal structure of the environment, which we hypothesize contributes to their utility when used with linear function approximation. In an ablation study, we find that increasing the value network's capacity improves the performance of our linear agents substantially, and that larger networks can accommodate more tasks. Perhaps surprisingly, we also find that our method performs best with what might seem like small number of auxiliary tasks: the smallest networks we study produce their best representations from 10 or fewer tasks, and the largest, from 50 to 100 tasks. In a sense, this finding corroborates the result of Lyle et al. (2021, Fig. 5) , where optimal performance (on a small set of Atari 2600 games and with the standard DQN network) was obtained with a single auxiliary task. From this finding we hypothesize that individual tasks may produce much richer representations than expected, and the effect of any particular task on fixed-size networks (rather than the idealized, infinite-capacity setting studied in the literature) remains incompletely understood.

2. RELATED WORK

Deep RL algorithms have employed auxiliary prediction tasks to learn representations with various emergent properties (Schaul et al., 2015; Jaderberg et al., 2017; Machado et al., 2018; Bellemare et al., 2019; Gelada et al., 2019; Fedus et al., 2019; Dabney et al., 2021; Lyle et al., 2022) . While most of these papers optimize auxiliary tasks in support of reward maximization from online interactions, our work investigates learning representations solely from auxiliary tasks on offline datasets. Closely related to our work is the study of random cumulants (Dabney et al., 2021; Lyle et al., 2021) , both of which identify random cumulant auxiliary tasks as being especially useful in sparse-reward environments. Our work differs from these prior works in both motivation and implementation. Notably absent in prior work on random cumulants is the study of representational capacity as a function of the number of tasks. Another body of related work on decoupling representation learning from RL primarily revolves around the use of contrastive learning (Anand et al., 2019; Wu et al., 2019; Stooke et al., 2021; Schwarzer et al., 2021b; Erraqabi et al., 2022) . Anand et al. (2019) proposed ST-DIM, a collection of temporal contrastive losses operating on image patches from environmental observations. Although the representations learned by ST-DIM are able to predict annotated state-variables in Atari 2600 games, their pretraining method was never evaluated for control. Stooke et al. (2021) uses contrastive learning for learning the temporal dynamics, resulting in minor improvements in online control from a fixed representation. Additionally, Schwarzer et al. (2021b) augments next-state prediction with goal-conditioned RL and inverse dynamics modelling, enabling strong performance on Atari 100k benchmark (Kaiser et al., 2020) . Our work is complementary to these prior works and investigates the utility of scaling auxiliary tasks for learning good representations, which in principle can be easily combined with existing approaches. Additionally, recent work on using state-similarity metrics tackles the representation learning problem through the lens of behavioral similarity (Castro et al., 2021; Zhang et al., 2021; Agarwal et al., 2021a) . We note that, in contrast to our method, the behavioral metrics used in these works are heavily based on the reward structure of the environment. Related to our method, Touati & Ollivier (2021) consider representation learning with the successor measure (see also Touati, 2021, Algorithm 7) . Algorithmically, their approach differs from ours in a number of ways, including the use of a learned state density function in lieu of indicator functions, the decomposition of the successor measure into its so-called forward and backward representations, and a bespoke sampling procedure to generate sample trajectories from which the representation is learned. By comparison, our approach directly constructs a relevant set of auxiliary tasks, which results in a significantly simpler algorithm that is more easily scaled according to available computational resources and to the full gamut of Atari 2600 games, as we will demonstrate. Recently, there have also been efforts to cast the representation learning problem in RL as a min-max objective where you learn state features that can linearly represent a specific class of value-functions (Bellemare et al., 2019) or the Bellman backup itself (Modi et al., 2021; Zhang et al., 2022) . Although we do not frame our method in terms of a min-max formulation, we do seek to learn a representation that can linearly predict the value function of the random policy for any given reward function. These previous works are primarily theoretical in nature and often require specific assumptions about the underlying MDP. In contrast, our class of auxiliary prediction tasks allows us to learn representations in environments with large, high-dimensional state-spaces, without any prior assumptions.

3. BACKGROUND

The RL problem can be modeled as a Markov Decision Process (MDP) defined by the 5-tuple M = X , A, R, P, γ , in which X is a set of states, A is a set of actions, R : X × A → R is a scalar reward function, P : X × A → P(X ) is a transition function that maps state-action pairs to a distribution over next states, and γ ∈ [0, 1) is a discount factor. A policy π : X → P(A) is a function that maps states to a distribution over actions. The goal of an RL agent is to learn a policy that maximizes the cumulative discounted rewards from the environment, also known as the discounted return. The state-action value function is defined as the expected discounted return when starting in a state and following the policy π: Q π (x, a) := E At∼π,P ∞ t=0 γ t R(X t , A t ) | X 0 = x, A 0 = a . In this paper, we consider approximating the value function Q π using a linear combination of features. We call the map φ : X → R k a k-dimensional state representation; φ(x) is the feature vector for a state x ∈ X . The value function approximant at (x, a) is Q(x, a) = φ(x) w a , where w a ∈ R k is a weight vector associated with action a. In deep RL, the state representation is parameterized by a neural network. Often, the representation is learned end-to-end by optimizing the parameters to make more accurate predictions about the value function. Additional predictions that further shape the state representation are called auxiliary tasks (Jaderberg et al., 2017) . In this work, we write T for the set of auxiliary tasks. The successor representation (SR; Dayan, 1993) encodes the temporal structure of the MDP in terms of which states can be reached from any other state under a given policy. It is given by ψ π SR (x, a, x) = ∞ t=0 γ t P{X t = x | X 0 = x, A 0 = a, A t>0∼π }. A convenient, recursive form expresses the SR in terms of an indicator function, highlighting that for each x, the SR is the value function associated with the reward function R(x, a) = 1{x = x}: ψ π SR (x, a, x) = 1{x = x} + γ E π ψ SR (X , A , x) | X = x, A = a .

4. PROTO-VALUE NETWORKS

In this section, we derive our proto-value networks algorithm. At a high level, this algorithm learns a state representation that approximates the singular vectors associated with the successor measure, the extension of the SR to continuous state spaces. We do this in order to derive an algorithm that is more suitably tailored to the large state spaces of deep RL domains, where many states are encountered once or never at all, and some notion of distance between states must be accounted for. To gain some understanding into this process, let us consider how the method of auxiliary tasks (Jaderberg et al., 2017) can be used to obtain a state representation that approximates the SR. In the tabular setting, where X and T are of finite sizes n and m respectively, we write the feature matrix Φ ∈ R n×d , so that each state x is associated with a feature vector φ(x) ∈ R d . Given an auxiliary task matrix Ψ ∈ R n×m , the method of auxiliary tasks can be shown to be equivalent to minimizing the loss function L(Φ, W ) = ΦW -Ψ 2 F = x∈X ,i∈T φ(x) w i -ψ i (x) 2 jointly with respect to Φ and W . Here, W ∈ R d×m is a weight matrix with columns (w i ) m i=1 and ψ i (x) is the entry of Ψ corresponding to state x and task i. In the sequel, we will assume that a near-optimal W can be obtained easily and simply consider the loss L(Φ) = min W L(Φ, W ), to be minimized over Φ. It is known (e.g., Bellemare et al., 2019) that any feature matrix that minimizes this loss function must have columns that lie in the subspace spanned by the top d left singular vectors of Ψ. In particular, when Ψ is square and symmetric the auxiliary task method recovers the subspace spanned by its top d eigenvectors. Here, we are interested in the setting in which Ψ πr is the SR matrix for the uniformly random policy. In the symmetric case, the eigenvectors of Ψ πr form what is called the proto-value functions of the MDP (Mahadevan & Maggioni, 2007) . These eigenvectors are of special importance because they encode the spatial structure of the MDP in terms of a diffusion process, and have been shown to correlate with neural encodings of spatial location in mammals (Stachenfeld et al., 2014) .

4.1. EXTENSION TO THE RANDOM SUCCESSOR MEASURE

Let π be a policy and Σ the power set of X . The successor measure ψ π : X × A × Σ → R extends the SR to quantify the discounted visitation frequency of an agent, in expectation over trajectories and for various subsets of the state space (Blier et al., 2021) . Given a set S ⊂ X , we write ψ π (x, a, S) = ∞ t=0 γ t P{X t ∈ S | X 0 = x, A 0 = a, A t>0 ∼ π} . As with the SR, this can be expressed in terms of an expectation over an indicator function, and further decomposed in a Bellman equation: ψ π (x, a, S) = ∞ t=0 E π γ t 1{X t ∈ S} | X 0 = x, A 0 = a, A t>0 ∼ π = 1{x ∈ S} + γ E π ψ π (X , A , S) | X = x, A = a . The passage from state equality to set inclusion is particularly appealing in deep RL: first, because states rarely repeat along a trajectory or between episodes, the indicator 1{x = y} is almost always zero. Second, set inclusion allows us to incorporate a notion of closeness to ψ π , e.g. by focusing on subsets S that include semantically similar states. We will return to this point later in the section. By analogy with the tabular setting, let us now define a loss function which, if suitably minimized, should produce a useful state representation. For ease of exposition, we continue to assume that X is finite, although perhaps very large; the reader interested in a proper mathematical treatment of the full continuous-state setting is invited to consult Blier et al. (2021) and Pfau et al. (2019) . Let ξ be a distribution over subsets of states and Ξ ∈ R n×n is a diagonal matrix with entries {ξ(x) : x ∈ X } on the diagonal. The Monte Carlo successor measure loss is L M CSM (Φ) = min w S,a ∈R d E S∼ξ x∈X ,a∈A (φ(x) w S,a -ψ π (x, a, S) 2 . Theorem 1. If Φ * is a feature matrix minimizing L M CSM (Φ), then its column space spans the top d left singular vectors of the (infinite-dimensional) successor measure matrix Ψ π with respect to the inner product (x, y) Ξ = y Ξx, for all x, y ∈ R n . In practice, samples of ψ π (x, a, S) (which must be estimated from complete trajectories) are not available; instead, it is preferable to learn an approximation by bootstrapping (Sutton & Barto, 2018) . The corresponding temporal-difference successor measure loss is min w S,a ∈R d E S∼ξ x∈X ,a∈A (1{x ∈ S} + γ E π φ(X ) w S,A | X = x, A = a -φ(x) w S,a 2 ; (1) we will use this form in the derivations that follow.

4.2. A PRACTICAL IMPLEMENTATION

Our algorithm aims to approximate the loss in Equation 1 using tools from deep RL. We first approximate the expectation over ξ by sampling a collection of sets (S i ) m i=1 from ξ. These sets are kept fixed throughout learning. With this in mind, each set corresponds to an indicator function that we treat as a binary reward function r i (x) = 1{x ∈ S i }. The actual auxiliary task is then the value function of the random policy associated with this reward. Denote by ψi (x, a) the prediction made by our neural network for state x, action a, and the set S i . Given a sample transition (x, a, x ), we define the sample target r i (x) + γ 1 |A| a ∈A ψi (x , a ) . Notice that the average over the next-action a arises as a consequence of taking the policy π to be uniformly random. We then train the neural network by performing stochastic gradient descent on the loss derived from this sample target: r i (x) + γ 1 |A| a ∈A ψi (x , a ) -ψi (x, a) 2 . Following common usage, the actual gradient estimate is obtained by aggregating multiple transitions into a minibatch and applying the Adam optimizer (Kingma & Ba, 2015) . Before explaining how the sets S i are defined, let us remark on a number of appealing properties of these auxiliary tasks, when viewed from a deep RL perspective. First, the use of a random policy means that learning usually proceeds in an off-policy manner. However, we expect this to be a relatively mild form of off-policy learning, one that is in general much more stable than one derived by maximization, as in a Bellman optimality equation. Although one could also learn the value function associated with the current policy (as in SARSA (Rummery & Niranjan, 1994) ), this precludes the use of offline datasets for learning the representation, or at least makes the learned representation strongly dependent on the behaviour policy. By contrast, the representation learned by PVN only depends on the availability of data. In effect, these auxiliary tasks depend only on the structure of the environment, and not on the agent's behaviour. We also expect binary reward functions to be easier to tune than, say, those derived from a distance function (dependent on getting the scale parameter correct) or real-valued random rewards (dependent on the underlying distribution). Binary rewards are particularly appealing in domains where the reward function is itself binary or ternary (i.e., Atari 2600 video games), in which case they can be adjusted to have similar statistics to the true reward function. We will demonstrate how to do this in the following section. 

4.3. GENERATING INDICATOR FUNCTIONS

Thus far we have described our algorithm as sampling sets of states (S i ) m i=1 which are then converted into a reward function by means of an indicator. In deep RL, this is inconvenient for two reasons: first, because it is not clear from what distribution of states should be sampled (how should one generate arbitrary video-game states?); second, because testing for set inclusion may also be brittle, effectively reducing to repeated equality tests. Instead, we opt here for an implied set, defined directly by its indicator function. Let F be a family of functions mapping X to {0, 1}. Then, for any function f ∈ F, its implied set is S f = {x ∈ X : f (x) = 1} . Of course, this is equivalent to f (x) = 1{x ∈ S f }. Sampling functions from F according to some distribution ξ f and using them in lieu of the indicator is therefore equivalent to sampling sets of states for some distribution ξ implied by ξ f . The advantage is that testing for inclusion in S f only requires the evaluation of f at x, which for carefully-chosen functions can be done at little computational cost. The simplest scenario occurs when the family F is parametrized by some weight vector θ, so that the random function f θ corresponds to a random set of states. In this paper we consider two such families of functions: universal hash functions and random network indicators. Both families are tunable, in the sense that they are parametrized so that the implied sets S f each cover a desired fraction of the overall state space. In probabilistic terms, tunable means that we can with minimal or no computation find parameters such that for any given state x, P{x ∈ S f } = p . Here, the probabilistic statement is with respect to the draw of f from F. For universal hash functions, the tuning is immediate from the algorithm, and so we describe it first. A Carter-Wegman family of hash functions F CW (Carter & Wegman, 1979 ) consists of functions mapping each integer x ∈ N to the set {0, . . . , k -1}, with the property that P{h(x) = i} = 1 k for i = 0, . . . , k -1, where the probabilistic statement is over the random draw of h from F CW . One may think of a CW family as deterministically assigning labels to integers x (in the sense that f is deterministic), but randomly (in the sense that f is random). See Appendix D.1 for full implementation details. We construct our tunable indicator function as f (x) = 1{h(x) = 0} . By construction, choosing k = 1 p yields the desired tuning (up to integer rounding). In our setting, x is a high-dimensional observation (for example, an image) rather than an integer; yet we will see that, perhaps surprisingly, encoding each image as a unique integer is sufficient to produce better-than-random state representations. One drawback of using universal hash functions to define sets of interest is that they may assign different values to perceptually near-identical states (a single pixel difference suffices). Following common usage (Burda et al., 2019; Dabney et al., 2021) , we may use randomly initialized neural networks to map similar states to similar values. Specifically, let us view a randomly initialized, single-output DQN network as a function g : X → R. We further decompose this function into a map g 1 : X → R l and a linear map from R l → R: g(x) = g 1 (x) ω + b, where ω is a parameter vector and b ∈ R is a bias term. With this in mind, we may simply construct the indicator function f (x) = 1{g(x) 0}. The result, however, is not yet tunable: it is hard to choose the right distribution of network weights so that a desired fraction of states satisfy f (x) = 1. However, for any p ∈ [0, 1] and any non-zero fixed ω, g 1 , and distribution of states µ, there exists a bias term b such that P x∼µ {g 1 (x) ω + b 0} = p . Such a bias term can accurately be determined from a small number of online interactions using the method of quantile regression (Koencker, 2005) ; the exact update rule is given in Appendix D.2. With this method, we obtain network-derived indicator functions that are tunable and are likely to assign similar values to perceptually similar states. We refer to this class of indicator functions as random network indicators (RNIs; Figure 1 ), which we empirically evaluate in the following section.

5. EMPIRICAL ANALYSIS

To disentangle the contributions of the primary and auxiliary tasks on the expressiveness of the learned features, we split our learning procedure in two parts: a representation pre-training phase, and an online RL phase. During the representation pre-training phase, we use transition data from offline Atari datasets in RL Unplugged (Agarwal et al., 2020; Gulcehre et al., 2020) and the procedure described in Section 4 to train an encoder which acts as a feature extractor (see Appendix D for complete implementation details). Note that while this dataset contains environment rewards, none of the methods make use of the environment rewards unless explicitly stated. Following the pre-training phase, we fix the weights of the learned encoder and train an RL agent online directly from this "frozen" representation. Notably, we train for only 3.75 million agent steps, compared to the 50 million agent steps (200M Atari 2600 frames) that is standard in most Atari setups. Our agents are implemented using the Acme library (Hoffman et al., 2020) . Our hyperparameter choices for both phases of training can be found in Appendix D.5.

5.1. SCALING CAPACITY WITH AUXILIARY TASKS

Prior work indicates that the optimal number of auxiliary tasks for representation learning is unexpectedly small, and that scaling up the number of auxiliary tasks can hurt performance (Lyle et al., 2021, Fig. 5 ). We expect that the representational capacity of the neural network has a strong effect on the number of auxiliary tasks we are able to learn with. To study this effect, we use the Impala-CNN network (Espeholt et al., 2018) and vary its effective width; that is, we multiply the number of convolutional filters and the number of features in the penultimate layer. We select a width multiplier in the set {1, 2, 4, 8} and sweep the number of tasks from {0, 10, . . . , 100}. For this experiment, we use 5 games (ASTERIX, BEAM RIDER, PONG, QBERT, and SPACE INVADERS) with 3 seeds for pre-training, resulting in 15 encoders for each combination of width multiplier and number of auxiliary tasks. During the online phase, we train with 3 seeds per encoder, resulting in a total of 45 runs per sweep configuration. We evaluate for 100 episodes after 1M agent steps. We summarize our results using Rliable (Agarwal et al., 2021b) . Figure 3 depicts the optimality gap (distance from human-level performance). We find that increasing the representational capacity of the network increases performance, even for a very small number of tasks. This is perhaps surprising, since it indicates that we only need a handful of tasks to train large-scale representations, corroborating results by Lyle et al. (2021) . Though a small part of this performance gain might be obtained just by virtue of having more output features (following the lottery ticket hypothesis )), we can see that there is a marked improvement when we increase the number of auxiliary tasks from 0 for all network sizes. We further find that as network capacity is increased, the algorithm can use more auxiliary tasks to improve its representation. For example, while the 2× network achieves maximal performance with 10 tasks, the 8× network performs best in the range of [50, 100] tasks. This gives evidence for the scalability of PVN as an approach for learning rich state representations.

5.2. EVALUATING THE LEARNED REPRESENTATION

Using the insights gained from our scaling experiment, we evaluate a model with a large number of auxiliary tasks on a broader suite of Atari games. We use the 8× network and fix the number of auxiliary tasks to 100, which empirically performed well. We use the same training setup described in the previous section, though we use all 46 games available in RL Unplugged. We use 3 seeds for offline pre-training, and 3 additional seeds per encoder during online training. We train for 3.75M agent steps, and evaluate for 100 episodes. We compare against the following pre-training baselines: Random Initialization: Randomly initialized features using the same network architecture. This simple baseline should confirm that the efficacy of our representations come from our pre-training procedure, and not merely because we use a large encoder network. Random Cumulants (RCs): Random reward functions introduced by Dabney et al. ( 2021), and later expanded upon by Lyle et al. (2021) . This method is similar to ours, but uses a random reward r i (x, x ) = s • (f (x ) -f (x) ) instead of the random indicator function, and replaces the average over next-state actions by a maximization (off-policy learning of the optimal policy for each cumulant). Here, f is also given by a random network.

Self-Predictive Representations (SPR):

A contrastive-learning method that directly optimizes for temporal consistency of the learned representation (Schwarzer et al., 2021a) . It does so by learning a latent-space transition model and forcing subsequent states to have similar representations. Behavior Cloning (BC): Behavior Cloning has been shown as a strong baseline in Offline RL, especially when increasing the amount of pre-training data (Schwarzer et al., 2021b; Baker et al., 2022) . It should give a strong indication of the performance that is possible when using large datasets. For each of these methods, we freeze the 8x encoder after the pre-training stage and use the previouslydescribed online training scheme. Figure 4 illustrates that PVN outperforms these baselines in all aggregate metrics. We also note that PVN using linear function approximation (3.75M agent interactions) is competitive with DQN (50M agent interactions) in many games, as illustrated in the per-game results found in Appendix F. We visualize the learned representations from different methods using multidimensional scaling (MDS) plots in Figure 2 (with more games in Appendix E). These plots show that different methods clearly lead to representations with different structures. Notably, the representations learned by PVN (RNI) place temporally-successive states close together, and appears to capture information about the dynamics of the environment without requiring access to the environment reward.

5.3. ABLATIONS

We perform ablative experiments to verify the importance of the different PVN components. First, we validate our choice of indicator function by replacing RNIs with the hash indicator functions described in Section 4. We compare their performance in Figure 5 , which shows that hash indicator functions perform poorly compared to RNIs; this indicates that the choice of indicator function is an important design decision. We expect that the inductive biases in random convolutional networks allow RNIs to include a notion of state similarity in the tasks they induce. Next, we hypothesize that using the random policy as the target policy is a key contributor to PVN's performance. To verify this hypothesis, we ablate the TD-target of our learning update to maximize over the next-state action-values, as per the Bellman optimality equation. When the mean function is replaced with the max function in the TD backup, PVN attempts to learn the optimal value function for each indicator function, rather than the value function of the random policy. The result of this experiment can be seen in Figure 5 . Using the mean formulation has a much higher median human normalized score than the max formulation. This is likely due to instability that arises from max bias and state coverage due to the off-policy learning required for the optimal value function. Learning the value function of the random policy also requires off-policy learning; however, we predict that it doesn't have such a large effect, as we previously described in Section 4.2.

6. DISCUSSION

While our experiments have shed some light on the scalability of auxiliary tasks, there are a number of remaining open questions that represent exciting opportunities for further exploration. An exciting future direction is to use insights from the literature on scaling models effectively (Tan & Le, 2019) to further scale the auxiliary tasks we introduced here. Orthogonally, it may still be possible to train with more tasks without increasing the capacity of the network. It is surprising that with even a relatively large network with tens of millions of parameters, such as Impala (8×), the network only supports a handful of tasks. It is not clear why training with more tasks leads to worse performance, especially for smaller Impala architectures. Finally, in line with Agarwal et al. (2022) , we have open-sourced our pre-trained representations, (see Appendix D) which we hope will enable researchers to tackle credit assignment on the ALE without needing to re-learn such representations.

A BACKGROUND

A.1 PROTO-VALUE FUNCTIONS PVF 1 PVF 2 PVF 3 PVF 4 Proto-Value Functions on Four-Room Grid Proto-Value Functions (PVFs) are defined in terms of the graph Laplacian L ∈ R n×n , that is L = D -A , where D ∈ R n×n is the degree matrix and A ∈ R n×n is the adjacency matrix. The actual PVFs are defined as the eigenvectors of the graph Laplacian, that is the non zero vectors v ∈ R n \ {0} verifying Lv = λv . where λ ∈ R is the eigenvalue associated with the eigenvector v. Individually, these eigenvectors correspond to different time-scales of the diffusion process of a random-walk over the state-space (Mahadevan & Maggioni, 2007) . Intuitively, PVFs can be thought of as capturing large-scale temporal properties of the environment. Figure 6 shows an example of the first four PVFs on the Four-Room domain (Sutton et al., 1999; Solway et al., 2014) to give some intuition for their structure.

A.2 THE SUCCESSOR REPRESENTATION

Let P π ∈ R n×n be the transition matrix and r π the reward vector, both induced by the policy π. We can now write the policy evaluation equation for the values v π ∈ R n as: v π = (I -γP π ) -1 Ψ π r π , where Ψ π is the Successor Representation (SR). We can also write each element of the SR as the expected discounted future occupancy for a state s given you start in a state s: Ψ π (s, s ) = t>0 γ t P(S t = s | S 0 = s) = E π γ t 1 {S t = s } | S 0 = s .

A.3 CONNECTION BETWEEN THE SR & PVFS

We can further connect the Successor Representation with Proto-Value Functions under some assumptions. Assumption 1. The Successor Representation is defined with respect to the uniform random policy. Assumption 2. The transition matrix P π is symmetric. Under the above assumptions, we have that the eigenvectors of Ψ π are equivalent to the PVFs (eigenvectors of L) (Machado et al., 2017; Stachenfeld et al., 2014) . This helps motivate the choice of the uniform random policy as the target policy in the PVN TD update.

B PROOFS FOR SECTION 4

Theorem 1. If Φ * is a feature matrix minimizing L M CSM (Φ), then its column space spans the top d left singular vectors of the (infinite-dimensional) successor measure matrix Ψ π with respect to the inner product (x, y) Ξ = y Ξx, for all x, y ∈ R n . Proof. We consider the SVD of the successor measure ψ with respect to the weighted inner product Ξ. In matrix form, we write Ψ = F ΣB T where F ∈ R n×d , Σ ∈ R d×d and B ∈ R n×d satisfy F T F = I, B T ΞB = I, Σ = diag(σ 1 , ..., σ d ) and σ i are the singular values of Ψ sorted in decreasing order. arg min Φ∈R n×d L M CSM (Φ) = arg min Φ∈R n×d min w S,a ∈R d E S∼ξ x∈X ,a∈A (φ(x) w S,a -ψ(x, a, S) 2 = arg min Φ∈R n×d min W (ΦW -Ψ) 2 Ξ = arg min Φ∈R n×d Π ⊥ Φ Ψ 2 Ξ where Π Φ is the orthogonal projection onto span(Φ). The above is equivalent to saying that Φ must span the top d singular vectors of Ψ.

C TABULAR RESULTS

Define the Successor Representation (SR) as Ψ π = (I -γP π ) -1 ∈ R n×n and assume that P π is symmetric. Let G k ∈ R n×( n k ) be the matrix containing all the binary vectors corresponding to all n k subsets (i.e., its columns have all possible k-hot binary vectors). For example, given n = 4 we have, G 2 =    1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0 1 1    . In the tabular setting we seek to learn the successor measure with respect to G k by minimizing L(Φ, W ) = ΦW -ΨG k 2 F . We know that the optimal Φ will span the principal components of ΨG k (Bellemare et al., 2019) . Note that when k = 1 we have, ΨG 1 = Ψ in which case the principal components are the PVFs Machado et al. (2017) . We want to characterize the principal subspace of ΨG k for 1 < k < n. Claim: C = (ΨG k ) (ΨG k ) has the same eigenvectors for all k ∈ {1, . . . , n -1}. Proof: We start by writing down the covariance matrix as C = (ΨG k ) (ΨG k ) = ΨG k G k Ψ . The matrix G k G k is a double-constant matrix (O'Neill, 2021), i.e., it has a constant a on the diagonal and a constant different from a on the off-diagonal: M k = G k G k =       a t t • • • t t a t • • • t t t a • • • t . . . . . . . . . . . . . . . t t t • • • a       . In our case we have a = n-1 k-1 and t = n-2 k-2 . Furthermore, we can use another property of double-constant matrices, we have that the eigenvalues of M k are λ 1 = λ * * = a -t + n • t and λ i = λ * = a -t for all i = 2, . . . , n. The eigenvectors for λ * * are v * * ∝ 1 where 1 is the vector of all ones. The eigenvectors for λ * are any non-zero vectors v * where v * • 1 = 0, i.e., v * is orthogonal to the vector of all ones. Next, we characterize the eigenspace of the matrix Ψ π . We have, Ψ = (I -γP π ) -1 = I -γQΛQ -1 -1 (Since P π is symmetric, hence diagonalizable) = Q (I -γΛ) -1 Q -1 . This means that the eigenvectors of Ψ π are the same as the eigenvectors of P π . We will denote the eigenvalues of P π to be λ i with associated eigenvectors x i . For simplicity, we denote the eigenvalues of Ψ π as µ i for i = 1, . . . , n. Note that µ i = (1 -γλ i ) -1 for i = 1, . . . , n. Furthermore, since P π is a stochastic matrix, we have that 1 is an eigenvector with eigenvalue 1. We let x 1 = 1 without loss of generality. Also, since P π is assumed to be symmetric, the eigenvectors can be chosen to be orthogonal to each other. Putting this all together, take x i to be the i-th eigenvector of Ψ π (and P π ). We now have, Cx i = ΨM k Ψ x i = ΨM k Ψx i (by symmetry) = ΨM k µ i x i . (x i is an eigevector of Ψ) Now there are two cases: Case 1: If x i = 1 (and i = 1) we have, Cx i = ΨM k µ i x i = Ψλ * * µ 1 1 = µ 1 λ * * µ 1 1 = µ 2 1 λ * * 1 Case 2: If x i = 1 (and i > 1) we know that x i is orthogonal to 1 (since P π is a symmetric matrix) thus lies in the second eigenspace of M k corresponding to the eigenvector v * . Therefore, we have, Cx i = ΨM k µ i x i = Ψλ * µ i x i = µ i λ * µ i x i = µ 2 i λ * x i . Thus, we have shown that C = ΨM k Ψ has the same eigenvectors as Ψ and are independent of k. The new eigenvalues are µ 2 1 λ * * for the eigenvector 1 and µ 2 i λ * for all other eigenvectors x i for i = 2, . . . , n.

D IMPLEMENTATION DETAILS

We have released a reference implementation along with notebooks demonstrating how to download and use our pre-trained representations at: https://github.com/google-research/google-research/tree/master/pvn.

D.1 UNIVERSAL HASH FUNCTIONS

We define the set of multiply-shift universal hash functions (Carter & Wegman, 1979) as: h i (x) =   a (i) 0 + n j=1 a (i) j • x j mod p   mod m , where x ∈ R n is a flattened vector of the environment's observation, a (i) ∈ R n is a randomly initialized vector that that parameterizes the hash function, p is a prime, which in our case is the Mersenne prime p = 2 13 -1, and m allows us to control the activation proportion of the indicator function. We can now define the indicator function as follows: f i (x) = 1{h i (x) = 0} . D.2 QUANTILE REGRESSION We use quantile regression to tune the proportion of states that trigger our random network indicator functions. To do so, we use a tunable bias that we update with gradient descent. First, recall that the random network indicators are computed using a random neural network f : x → R. If we naively apply the SIGN function to the network output, the proportion of states that map to 1 is unlikely to match the target proportion p. Therefore, we first add a bias term to the output r = f (x) + b, and then tune the bias to minimize the quantile regression loss L QR (b) = E x∈X r (x) • ((1 -p) -SIGN(r (x))) (2) Once the bias has been tuned, the output of the random network indicator is r = SIGN(f (x) + b).

D.3 ALGORITHM

Algorithm 1 gives pseudo-code for the method as implemented with a fixed replay memory.  L PVN (θ) ← 1 n n i=1 1 m m j=1 r j (x i ) + γ 1 |A| a ∈A Ψ(j) θ -(x i , a ) - Ψ(j) θ (x i , a i ) 10: 11: # Calculate quantile regression loss 12: L QR (b j ) ← 1 n n i=1 r j (x) • ((1 -p) -SIGN(r j (x))) 13: 14: # Perform gradient step  15: Update θ ← θ -η 1 ∂ ∂θ L PVN (θ) 16: Update b j ← b j -η 2 d dbj L QR (b j ) ∀ j = 1, . . . , θ -← τ θ -+ (1 -τ ) θ 20: end for D.4 SELF-PREDICTIVE REPRESENTATIONS (SPR) We implement an 8x version of SPR (Schwarzer et al., 2021a) using the same parameters as in, Schwarzer et al. (2021a) except we take the final fixed representation to be the projection layer in addition to the convolutional encoder. This was done to maintain the number of features for all our pre-trained methods. We also train SPR for much longer than in the original paper, specifically, we perform the same number of gradient steps as PVN.

D.5 HYPERPAREMETERS

In the tables below we report all relevant hyperparameter choices for both our offline pre-training phase, and online learning phase. We selected most of our hyperparameters based on best practices from previous work. We chose p based on the estimated reward proportion from actual Atari games. We tuned our online hyperparameters using 5 tuning games, ASTERIX, BEAM RIDER, PONG, QBERT, and SPACE INVADERS. 

E MDS PLOTS

Below are a selection of MDS plots for the methods discussed in the paper for each of the 5 tuning games. These plots are generated using the representations learned during the pre-training phase, and one expert trajectory is presented in each plot. Darker points correspond to states at the beginning of the trajectory, and lighter points correspond to states at the end of the trajectory. These plots demonstrate that the representations learned by each method are clearly different, and therefore have different properties. With these MDS explorations, we hope to gain some insight into the properties of each learnt representation. Motivated by PVFs, we expect a good (general) representation to capture the structure of the underlying transition dynamics of the environment. We note that PVN captures the temporal structure of each episode relatively well. With PVN, states that are near together in time have similar features, aligning with the properties of PVFs in the tabular case. 

G TRAINING CURVES

Figure 8 presents training curves for all 46 games in RL Unplugged. Each point corresponds to the average episodic return during training binned over 1M frames. The shaded region corresponds to the 95% bootstrapped confidence interval for the mean over three runs. Note: These results will differ from Table 3 as we use a separate evaluation phase with a lower value of as is standard in the ALE. 



Behzadian & Petrik (2018) gives the singular-vector extension for the asymmetric case. Because this extension is straightforward and symmetry rarely holds, in this paper we use the term proto-value networks to describe state representations learned in both the symmetric and asymmetric settings. This corresponds to using a SIGN nonlinearity at the end of the network.



Figure 1: (a) State equality indicator implemented as a one-hot encoding over X , as in the Successor Representation, while (b) Random Network Indicators parameterize the sets S i in PVN. Each panel shows a grid world with a reward function r(x) derived from the associated indicator. The transition arrow is the process of learning the value function from the preceding reward function.

Figure 4: Performance of PVN RNI vs other methods described in Section 5.2. Computed using 125 seeds, aggregated across 46 Atari games. 0.00 0.15 0.30 0.45 Random Initialization PVN (Hash) PVN (RNI + Opt. Policy) PVN (RNI) Median 0.00 0.15 0.30 0.45

Figure 5: PVN (Hash): PVN with hash indicator functions. PVN (RNI): PVN with random network indicators. A randomly initialized network with (8×) capacity is plotted for comparison.

Figure6: First four proto-value functions (eigenvectors of Ψ π when π is the uniform random policy) on the Four Room grid world.

Proto-Value NetworksRequire: Transition dataset D, Function approximator Ψθ :X → R m×|A| , m RNI networks f i : X → R, m RNI threshold bias vectors b i , Polyak coefficient τ , reward proportion p 1: for step = 1, . . . do 2:Sample mini-batch of n transitions {(x, a, x )} x) ← f j (x) + b j ∀j = 1, . . . , m 6: r j (x) ← SIGN(r j (x))

Figure 7: MDS plot for a single trajectory.

Figure 8: Training curves for: PVN (RNI), SPR, and Random Cumulants on all 46 games in RL Unplugged. The shaded region corresponds to the 95% bootstrapped confidence interval for the mean over three runs. The dashed horizontal line corresponds to the average evaluation score for Behavioral Cloning over three runs.

m

PVN Hyperparameters

ACKNOWLEDGEMENTS

We thank Nathan U. Rahn, Max Schwarzer, Harley Wiltzer, Wesley Chung, Adrien Ali Taïga, Pierluca D'Oro, David Meger, and Doina Precup for their useful feedback on this work. A special thanks to Wesley Chung for completing the tabular proof presented in Appendix C. This work was supported by the National Sciences and Engineering Research Council of Canada (NSERC) and the Canada CIFAR AI Chair program.We also acknowledge the Python community whose contributions made this work possible. In particular, this work made extensive use of Jax (Bradbury et al., 2018) , Flax (Heek et al., 2023) , Optax (Babuschkin et al., 2020) , Numpy (Harris et al., 2020 ), Pandas (Wes McKinney, 2010), Matplotlib (Hunter, 2007), and Seaborn (Waskom, 2021).

F PER-GAME RESULTS

Below, we report the per-game results for the methods discussed in the paper. In addition, we include DQN and the Environment Reward method, which trains an encoder using the environment reward during the pre-training phase, and then uses the fixed representation to train a linear head in the same manner as the compared methods. Note that Environment Reward acts as a kind of oracle in our setting -it is the only method that has access to the environment reward during the pre-training phase. The results reported here use 1 offline seed and 3 online seeds, and evaluation scores (averaged over 100 evaluation runs) are reported after 3.75M agent steps. DQN's performance is reported after 50M agent steps. 

