GRAPH BACKUP: DATA EFFICIENT BACKUP EXPLOIT-ING MARKOVIAN TRANSITIONS

Abstract

The successes of deep Reinforcement Learning (RL) are limited to settings where we have a large stream of online experiences, but applying RL in the data-efficient setting with limited access to online interactions is still challenging. A key to data-efficient RL is good value estimation, but current methods in this space fail to fully utilise the structure of the trajectory data gathered from the environment. In this paper, we treat the transition data of the MDP as a graph, and define a novel backup operator, Graph Backup, which exploits this graph structure for better value estimation. Compared to multi-step backup methods such as n-step Q-Learning and TD(λ), Graph Backup can perform counterfactual credit assignment and gives stable value estimates for a state regardless of which trajectory the state is sampled from. Our method, when combined with popular off-policy value-based methods, provides improved performance over one-step and multi-step methods on a suite of data-efficient RL benchmarks including MiniGrid, Minatar and Atari100K. We further analyse the reasons for this performance boost through a novel visualisation of the transition graphs of Atari games.



We propose a specific implementation of Graph Backup, extending Tree Backup (Precup et al., 2000) (Section 4, see Figure 7 (c)). Our method improves data efficiency and final performance on MiniGrid (Chevalier-Boisvert et al., 2018) , Minatar (Young & Tian, 2019 ) and Atari100K when using Graph Backup combined with DQN (Mnih et al., 2015) and Data-Efficient Rainbow (van Hasselt et al., 2019) compared to other backup methods, showing that utilising the graph structure of the trajectory data leads to improved performance in the data-efficient setting (Section 5). To more fully understand where this gain in performance comes from, we further investigate the graph sparsity of different environments in relation to the performance of Graph Backup, in part using a novel method to visualise the full set of seen transitions and their graph structure (Section 6).

2. RELATED WORK

The idea of multi-step backup algorithms (e.g. TD(λ), n-step TD) dates back to early work in tabular reinforcement learning (Sutton, 1988; Sutton & Barto, 2018) . Two approaches to multistep targets are n-step methods and eligibility trace methods. The n-step method is a natural extension of using a one-step target that takes the rewards and value estimations of n steps into future into consideration. For example, the n-step SARSA (Rummery & Niranjan, 1994; Sutton & Barto, 2018) target for step t is simply the sum of n-step rewards and the value at timestep t + n: R t+1 + R t+2 + ... + R t+n-1 + V (S t+n ). Graph Backup is an extension of an n-step backup target, Tree Backup, which will be described in Section 3. Eligibility trace (Sutton, 1988) methods instead estimate the λ-return, which is an infinite weighted sum of n-step returns. The advantage of the eligibility trace method is it can be computed in an online manner without explicit storage of all the past experiences, while still computing accurate target value estimates. However, in the context of off-policy RL, eligibility traces are not widely applied because the use of a replay buffer means all past experiences are already stored. In addition, eligibility traces are designed for the case with a linear function approximator, and it's nontrivial to apply them to neural networks. van Hasselt et al. (2021) proposed an extension of the eligibility trace method called expected eligibility traces. Similar to Graph Backup, this allows information propagation across different episodes and thus enables counterfactual credit assignment. However, similar to the original eligibility traces methods, it is a better fit for the linear and on-policy case, whereas Graph Backup is designed for the non-linear and off-policy cases. Since a learned model can be treated as a distilled replay buffer (van Hasselt et al., 2019) , we can view model-based reinforcement learning as related to our work. Recent examples include Schrittwieser et al. (2020) ; Hessel et al. (2021) ; Farquhar et al. (2018) ; Hafner et al. (2021b) ; Kaiser et al. (2020b) ; Ha & Schmidhuber (2018) . These MCTS-based algorithms also share some similarities with Graph Backup as they also utilise tree-structured search algorithms. However, our work is aimed at model-free RL, and so is separate from these works. Several recent works have also utilised the graph structure of MDP transition data. Zhu et al. (2020) propose to use the MDP graph as an associative memory to improve Episodic Reinforcement Learning (ERL) methods. This allows counterfactual reward propagation and can improve data efficiency. However, the usage of a data graph in this work is different from the usage in Graph Backup: the graph is used for control and as an auxiliary loss, rather than for target value estimation. Their associative memory graph also doesn't handle stochastic transitions and the return for each trajectory is only based on observed return (no bootstraping is used), unlike our work. Topological Experience Replay (Hong et al., 2022, TER) uses the graph structure of the data in RL for better replay buffer sampling. TER uses the graph structure to decide which states should be sampled from the replay buffer during learning, by implementing a sampling mechanism that samples transitions closer to the goal first. This work is orthogonal (and possibly complementary) to ours, as TER is a replacement for uniform or prioritized sampling from a replay buffer while Graph Backup is a replacement for one-step or multistep backup for value estimation.

3. PRELIMINARIES: ONE-STEP AND MULTI-STEP BACKUP

Given an MDP M we denote A as the action space; S to be state space; R ⊂ R to be reward space; and a t ∈ A, s t ∈ S are used to denote the specific actions and states respectively observed at step t. We denote a trajectory of states, actions and rewards as τ = (s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , ...). For a transition (s t , a t , r t , s t+1 ) the loss function of DQN methods is defined as the mean square errorfoot_0 between the predicted q-value and the backup target G a T for (s t , a t ): L(θ|s t , a t ) def = (q θ (s t , a t ) -G at ) 2 , ( ) where q θ represents the online network parameterized by θ. The backup target G at is an estimation of the optimal Q-value q * (s t , a t ). Vanilla DQN uses one-step bootstrapped backup, which makes gradient descent an analogue to the update of tabular Q-learning: G at t:t+1 def = r t+1 + γ max a ′ q θ ′ (s t+1 , a ′ ) where θ ′ are the parameters of the target network, which is standard in DQN. The one-step target makes the propagation of the reward information to previous states slow, which is amplified by the use of a separate frozen target network. This motivates the use of more sampleefficient multi-step targets in DQN (Hessel et al., 2018; Hernandez-Garcia & Sutton, 2019) . A widely used multi-step backup algorithm is n-step Q-Learning (n-step-Q) (Hessel et al., 2018; Silver et al., 2017) . This method sums the rewards in next n steps, together with the maximum q value at step n: G at t:t+n def = r t+1 + γr t+2 + ... + γ n max a ′ q θ ′ (s t+n , a ′ ). n-step-Q exploits the chain structure of the trajectories with little computational cost but at a cost of biased target estimation. The distribution of the sum of the rewards r t+1 + γr t+2 + ... + γ n-1 r t+n are conditioned on the behaviour policy µ which generates the data. This means that in an off-policy setting the estimated target value can be biased towards the value of the behaviour policy. Another off-policy multi-step target is Tree Backup (Precup et al., 2000) . Tree Backup is designed for general-purpose off-policy evaluation, meaning it aims to estimate the value of any target policy π by observing the behaviour policy µ. When the target policy is the optimal policy given by q θ ′ , Tree Backup recursively applies one-step-Q backup to the trajectory, bootstrapping with the target value network when the input action a isn't that taken in the trajectory (a t ): G at t:t+n def = r t+1 + γ max a ′ G a ′ t+1:t+n , if t < n, a = a t q θ ′ (s t , a), otherwise. Despite what its name suggests, Tree Backup does not expand a tree of states and transitions, and so still only leverages the chain structure of trajectories. The name is because the trajectory has leaves corresponding to the actions that were not selected in the current trajectory. In Figure 7 (c) we show the backup diagram of the 3-step Tree Backup, where yellow squares are these leaf actions.

4. GRAPH BACKUP

In this section we introduce a new graph-structured backup operator, Graph Backup, extending the multi-step method Tree Backup. Graph Backup allows counterfactual reward propagation and variance reduction while also having the benefits of multi-step backup.

4.1. INTRODUCING GRAPH BACKUP

We propose the Graph Backup operator that propagates temporal differences across the whole data graph rather than a single trajectory. The differences between one-step, multi-step, tree and Graph backup are illustrated in Figure 7 (c). We want a backup method that can work with stochastic transitions, which means a single state-action pair can lead to different states. This means it's not obvious how to perform recursive backups to the next state, as there could be multiple next states. We estimate the transition probability to each next state using visitation counts, and use the estimated transition probabilities to compute the empirical mean over all possible state value estimates weighted by the likelihood of transitioning to that state. This is easy to calculate efficiently and provides strong results as seen in Section 5. Denoting the set of all seen transitions to be T ⊆ S×A×R×S , a counter function f : T → N + maps each transition T = (s, a, r, s ′ ) to its frequency f (T ). Notabely, this counter function also plays 2 roles 1) the adjacency list of the graph; 2) to weight transitions when the same action leads to different future states. The Graph Backup target for a state-action pair (s, a) is then the average of recursive one-step backup of all outgoing transitions. Similar to Tree Backup, if the (s, a) has not been seen, the target is estimated directly by the target network. Define T s,a def = {(ŝ, â, r, ŝ′ ) ∈ T |ŝ = s, â = a}, the set of all (ŝ, â, r, ŝ) tuples starting with s, a. Extending Tree Backup, we can then define the Graph Backup (GB) value estimate as G a s def = 1 c(s,a) T ∈Ts,a f (T ) r + γ a ′ π(a ′ |ŝ ′ )G a ′ ŝ′ if c(s, a) > 0 q θ ′ (s, a)otherwise. (5) where c(s, a) = T ∈Ts,a f (T ) is the normaliser, π is the target policy and q θ ′ is the target network. In the case where target policy always chooses the action with optimal Q value π(a|s) = 1(argmax a ′ G a ′ s = a), the formula can be simplified into: G a s def = 1 c(s,a) T ∈Ts,a f (T ) r + γ max a ′ G a ′ ŝ′ if c(s, a) > 0 q θ ′ (s, a) otherwise. (6) This is often the case since our implementations are based on DQN and we are interested in the optimal Q-value. In this paper Graph Backup refers to the simplified version in Equation ( 6). On a high level, Graph Backup does dynamics programming Q evaluations on a Empirical MDP, where all the Q values for untried state-action pairs are initialized by a Q network. Our Graph Backup implementation extends Tree Backup. However, there could be other implementations which extend other multi-step methods, such as n-step-Q backup or the n-step version (Hernandez-Garcia & Sutton, 2019) of Retrace (Munos et al., 2016) . In Appendix I, we present a variation of Graph Backup that extends n-step-Q backup. Note that in Equations ( 6), ( 7) and (8), the graph structure does not appear explicitly. This is because it's easier to mathematically formalise these backup operators using transition counts; from an implementation perspective building and maintaining the data graph is the most efficient way of calculating these target value estimates. To better provide intuition for Graph Backup, in Appendix C we explicitly describe the data graph generated from an MDP and link that to Equation (6). The data graph contains the information for calculating and sampling from T , T s , T s,a , c(s, a), c(s) and f (T ).

4.2. ADVANTAGES OF GRAPH BACKUP

In Figure 2 we explain conceptually how Graph Backup brings benefits to value estimation and thus the learning of the agent. We also present a more empirical analysis in Appendix M Figure 7 . Assuming the value estimates of all the states are initialised as 0, the one-step backup can update the value of only 1 state. The multi-step backup methods can further propagate the reward to the whole trajectory that leads to the reaching of the goal.foot_1 However, Graph Backup goes beyond that and propagates rewards to the states of another trajectory (the dashed line). This feature of counterfactual reward propagation can significantly benefit the credit assignment of sparse reward tasks: During the exploration of a sparse-reward environment, policies usually generate a large number of trajectories that do not reach the goal, and while multi-step methods cannot efficiently leverage those transitions, Graph Backup can reuse them by propagating rewards from the crossovers with other successful trajectories. The second row of Figure 2 shows another advantage of Graph Backup: reducing the variance of the value estimate. Multi-step backup in this case will assign different value estimates for the same state depending on the trajectory the state is sampled from (as it will appear multiple times in the replay buffer). This brings extra noise to the value estimate which can be harmful to learning. 3In Figure 5 , we showed a simple case in MiniGrid where this target value noise can constantly disturb the convergence of DQN. Graph Backup removes this source of variance by ensuring that the same state always has the same value estimate regardless of which trajectory it's sampled from by calculating the value estimate from the underlying data graph. In addition, in stochastic environments Graph Backup reduces variance by averaging over different next states from the same state-action pair. 

4.3. LIMITING EXPANSION OF THE GRAPH

A naïve implementation of Graph Backup would follow the definition exactly and do an exhaustive recursive expansion of the graph. However, the computational cost of doing so can quickly blow up with the size of the replay data. 4 Therefore, similar to the n-step backup methods, we need to limit recursive calls. For Graph Backup, this means expanding a smaller local graph from the source state, using the target network for value estimation when reaching expansion limits. In our work, the expansion of the local data graph has both a breadth limit b and a depth limit d. When the breadth limit is hit (|T s,a | > b), we will sample b transitions from T s,a according to their frequency f , as opposed to expanding all transitions. If the depth limit is hit (d < n) the expansion of the graph will be terminated (so the second case in Equations ( 6) and ( 7) is taken). The pseudocode for local graph expansion is shown in Algorithm 2. Figure 7 (c) also illustrates examples of limited expansion for Graph Backup, where faded nodes are clipped away due to hitting the limit. In our work, we make sure the expansion will reach d steps in order to better align with multi-step methods. This makes sure the algorithm will reduce to Tree Backup gracefully when there are no crossovers between the trajectories. It also allows a more principled comparison between Tree Backup and Graph Backup. For b = 1 Graph Backup will do a similar job as d-step Tree Backup, and increasing b will gradually make the Graph Backup leverage more structure from the transition graph.

4.4. INTEGRATION OF OTHER RAINBOW COMPONENTS

To demonstrate that Graph Backup improves data efficiency in a realistic state-of-the-art algorithm, we integrate Graph Backup inside Rainbow (Hessel et al., 2018) . As a replacement for n-step-Q backup, Graph Backup is orthogonal to all other ingredients. While some components such as prioritized experience replay(PER) (Schaul et al., 2016) , noisy networks (Fortunato et al., 2018) and duelling network architectures (Wang et al., 2016) can be plugged in seamlessly, others require more care, which we describe here. Combining double DQN (van Hasselt et al., 2016) with Tree Backup and Graph Backup is quite straightforward. Double DQN uses an online network instead of a target network to specify the optimal policy in the bellman update, so that max a q θ ′ (s, a) = q θ ′ (s, argmax a q θ ′ (s, a)) becomes q θ ′ (s, argmax a q θ (s, a)) in one-step or n-Step-Q backup. For Tree Backup and Graph Backup, we can take the same approach for every expanded state. Distributional RL (Bellemare et al., 2017) , specifically C51, attempts to model the whole distribution of the state-action value rather than the expectation, using a distributional version of the bellman update (namely, one-step backup) when applied in the DQN setting. C51 divides the support of the value into discrete bins, called atoms, and the q network then outputs categorical probabilities over the atoms. In the distributional bellman update, the vanilla bellman update is applied to each atom, and the probability of the atom is distributed to the immediate neighbours of the target value. The loss is the KL divergence between the target and predicted value distribution rather than the mean squared error. In order to combine C51 and Tree Backup or Graph Backup, we apply the distributional bellman update in every state node.

Algorithm 1 Double Distributional Graph Backup

Input: source state S source , source action A source , frequency mapping f : T → N + , list of states in the subgraph L, atoms z 0 , z 1 , ..., z N -1 , online network p(•, •|θ) and target network p(•, •|θ ′ ) 1: Set S expanded be the set containing all the states in list L 2: Initialize the target values Ḡa s = q θ ′ (s, a), ∀s ∈ S expanded , a ∈ A 3: for (s, a) in l max , l max-1 , ..., l 1 do 4: a * = argmax a i z i p i (s, a|θ) 5: m i (s, a) = 0, i ∈ 0, 1, ..., N -1 6: for j ∈ 0, 1, ..., N -1 do 7: for t = (s, a, r, s ′ ) ∈ T s,a do 8: z ′ j ← [r + γz j ] VMAX VMIN 9: b j ← (z ′ j -V MIN )/∆z 10: l ← ⌊b j ⌋, u ← ⌈b j ⌉ 11: m l (s, a) ← m l (s, a) + f (s,a,r,s ′ ) c(s,a) p j (x x+1 , a * )(u -b j ) 12: m u (s, a) ← m u (s, a) + f (s,a,r,s ′ ) c(s,a) p j (x x+1 , a * )(b j -l) 13: end for 14: end for 15: end for 16: return m 0 (S source , A source ), ..., m N -1 (S source , A source ) In Algorithm 1, we combine double and distributional RL with Graph Backup given the subgraph state list calculated by Algorithm 2. Blue lines show the changes introduced by Graph Backup. 

4.5. ASSUMPTIONS

The effectiveness of Graph Backup relies on two assumptions about the environment: (1) the transition function of the environment is Markovian, and (2) there are crossovers between state trajectories. We show in Section 5 that-perhaps counter-intuitively-these assumptions hold frequently enough in high dimensional environments (Atari100 from pixel input) for Graph Backup to differentiate itself from Tree Backup in a statistically significant manner. As such, these restrictions are not as strict as may appear, and we further discuss how they can be relaxed in Section 7.

5. EXPERIMENTS

In order to test whether Graph Backup can bring benefits to the data efficiency of a DRL agent, we conduct experiments on singleton-MiniGrid, MinAtar and Atari100K. These tasks have an increasingly sparse transition graph so that we can see how many crossovers are needed for Graph Backup to bring significant performance improvements. The baseline agent for MiniGrid and MinAtar is DQN (Mnih et al., 2015) and for Atari100K is Data-Efficient Rainbow (van Hasselt et al., 2019) . The average training curves of the different backup methods are shown in Figure 3 , where we run each algorithm for 5 random seeds. The performance metric for Atari in the plot is the mean and median of human-normalised scores (%). The final performance for each task and method can be found in Table 1 , where we also include both mean and median metrics. The full results of each individual task are shown in Table 3 in Appendix. MiniGrid We first compare the methods in 5 singleton MiniGrid tasks: Empty8x8, DoorKey6x6, KeyCorridorS3R1, SimpleCrossingS9N2 and LavaCrossingS9N2. Every single run (out of 5) has a different but fixed random seed within the whole training process. We set the environment to be fully observable so that the transitions are Markovian. The overall number of possible states is small and the data graph is thus quite dense. The reward of MiniGrid is only given at the end of the episode, which makes credit assignment a critical problem. Among the 5 tasks, one-step backup and Tree Backup only managed to converge within 1e5 steps for the easiest empty room task. For other tasks with more complex navigation (SimpleCrossing and LavaCrossing) and interaction with objects (DoorKey and KeyCorridor), only the Graph Backup converged this low data regime. Minatar 100K steps. We perform experiments on Minatar. Minatar is a collection of miniature Atari games with a symbolic representation of the objects. The game state is fully specified by the observation of a 10 by 10 image, where each pixel corresponds to an object. We set the overall number of interactions to be 100,000, inspired by the Atari100K benchmark (Kaiser et al., 2020b) . We can see in Figure 3 that Graph Backup outperforms Tree Backup, n-step-Q backup and one-step backup in terms of mean scores and interquartile mean (IQM), in the data-efficient setting. Atari100K In order to test if Graph Backup can be applied on tasks with pixel observations, we test it in Atari100K. As suggested by its name, Atari100K limits the number of interactions of Atari games to 100,000, which is equivalent to 2 hours of game-play in Atari. Since the human performance scores reported by Mnih et al. (2015) are also from human experts after 2 hours of game-play, Atari100K is considered as a test-bed for human-level data-efficient learning. We follow the standard frame processing protocol used by Rainbow Hessel et al. (2018) without any other downsampling. The frame is processed into an 84 by 84 greyscale image and the observation is a stack of 4 previous frames, which leads to very sparse transition graphs. 5 The baseline we chose for Atari100K is data-efficient Rainbow van Hasselt et al. ( 2019), which is a variation of Rainbow that is optimized particularly for Atari100K. Consistent with the results in MiniGrid and MinAtar, Graph Backup performs better than one-step and multi-step methods. The results here show: 1) Graph Backup is robust in terms of bringing orthogonal improvements over other DQN improvements. 2) Graph Backup works for high-dimensional, pixel-based tasks that have sparse transition graphs. Minatar 1M steps In order to see the asymptotic performance of Graph Backup, in that duplicated states happen independently, this means that in 53% of the backup updates, there will be a crossover on the next 10 steps, which means the Graph Backup will give a different value estimate to multi-step methods more than half the time. However, the crossovers are not distributed uniformly on the transitions graph. In order to investigate the topology of the graph, we visualize the exact graph structure of the whole transition graph. In Figure 4 , we show four representative transition graphs and leave the others to Figure 8 in the Appendix. We apply radial layouts as proposed by Wills (1999) , which scales well with the number of nodes and aligns well with the transition structure of most of the games, where the central point corresponds to the start of the game. Many transition graphs of Atari100k games show interesting crossover structures that can be leveraged by Graph Backup. For example, the transition graph of Freeway forms a windmill-like pattern, where each blade corresponds to a group of trajectories that have connections within the group but between groups. There are also some tasks (e.g. Alien) where crossovers mostly happen in the starting stage of the game. Explaining Crossovers. By comparing the transition graphs and the pixel observations, we can provide two explanations for the existence of the trajectory crossovers in Atari games. The first factor is that there is a low number of degrees of freedom for the objects and especially for the agent avatar in many of these games. For most of the Atari games, the agent can only move on a 2D plane, which partially alleviates the curse of dimensionality since we only need two dimensions to specify the position of the agent. A second important factor is that Atari games always have a fixed initial state. Although we follow prior work (Hessel et al., 2018) to add a random number of no-ops after the start of the game, the initial observations the agent sees will still be quite similar, and this creates crossovers at the beginning of the episode (the centre of the plot, for example for Alien in Figure 4 ). In light of these characteristics, we recommend practitioners apply Graph Backup with exact state matching to tasks that either (1) have few degrees of freedom, (2) have a discrete state space, or (3) are highly repetitive with minimal noise. There are many real-world tasks that have similar properties, such as any 2D navigation (e.g. household cleaning robots), power management and manufacturing in assembly lines.

7. LIMITATIONS AND FUTURE EXTENSIONS

The high-level insight of Graph Backup is to treat all the transition data as a collective entity rather than independent trajectories, and exploit its (graphical) structure for sake of efficient learning. This work shows a simple implementation of this idea, by building the graph with exact state matching. While there are already a large class of tasks that have crossovers, we expect in future to extend Graph Backup to cover even more challenging tasks. For example, with a learned or human-crafted discrete representation of the true state (van den Oord et al., 2017; Hu et al., 2017; Hafner et al., 2021a; Kaiser et al., 2020a) or with a similarity kernel measuring distance between states, Graph Backup might be able to tackle continuous control or tasks with partial observability.

8. REPRODUCIBILITY STATEMENT

The code to reproduce all the experiment results are available in the supplementary materials. README includes guides to setup and environment and commands (with hyperparameters) to run the experiments. Experiments for all environments can be run on CPUs but (a single) GPU can speed up for Atari tasks. Besides codes, pseudo-code in Algorithm 2 and Algorithm 1 and hyperparameters described in Appendix D can also be helpful to reproduce the results if readers are interested in implementing by themselves.

Algorithm 2 Local Graph Expansion

Input: source state S source , source action A source , depth limit d, breath limit b, frequency mapping f : T → N + 1: Initialize the set containing states on the boundary of expansion S new ← {S source } 2: Initialize the list of expanded state-action pairs L, denoting the last element in the list to be l max 3: for i = 0 to d do 

A OTHER EMPIRICAL FINDINGS

In general, the experiments in three different settings shows Graph Backup consistently brings improvements over multi-step methods. Besides that, we also find that the improvements of n-step-Q backup over one-step backup are actually quite limited in the data-efficient setting, whereas Tree Backup performs significantly better than n-step-Q backup. This can be explained by the off-policy nature of Tree Backup, as it can bring the benefits of multi-step reward propagation without biasing the value estimation. In data-efficient setting, the flaw of n-step-Q is amplified as the learning relies more on historical rather than freshly sampled data. Interestingly, Tree Backup has not received a lot of attention in DRL community. Hernandez-Garcia & Sutton (2019) tested Tree Backup in a toy mountain car experiment which shows n-step-Q performs best among multiple multi-step methods, including Tree Backup. (Touati et al., 2018) points out the instability of Tree Backup when combined with functional approximation, both with theoretical analysis and empirically evaluation on some constructed counter-example MDPs. However, our experiments on a larger scale and more diverse set of tasks show Tree Backup has superior sample efficiency when combined with a modern DRL method.

B STABLE VALUE ESTIMATE

Graph backup can integrate information from a subgraph, yielding a more accurate and stable value estimation. On the other hand, the nested max operators might lead to overestimation of the value. Here we try to analyse the value estimation given by different backup operators. We collect 5000 transitions with random walks in a MiniGrid 5x5 Empty Room environment, and have the agents to learn the q * from this static dataset. In Figure 5 , we show the mean and standard deviation of latest 10 estimates of the same state-action pairs. Both mean and standard deviation are averaged over different state-action pairs. The value estimate of Graph Backup quickly stabilize after a few hundred optimization iterations as the mean value converged and the standard deviation reduce to near 0. However, all other backup methods keep giving a varied estimate for the exact same states leading to a higher standard deviation and a fluctuating mean. In terms of over-estimation, the graph backup does give a slightly higher estimate at first (the little bump in the curve), however, it quickly recovers to a stable value.

C DATA GRAPH DEFINITION

An MDP data graph is a bipartite multigraph (S seen , N , E out , E in ), where S seen ⊆ S is the set of seen state nodes, N ⊆ S seen × A is the set of (state conditioned) action nodes, E out is a multiset of edges pointing from state nodes to action nodes and E in is a multiset of reward-weighted edges pointing from action nodes to state nodes. A state node can point to multiple action nodes because multiple actions might be tried, and the action nodes can point to multiple state nodes because of the stochastic dynamics. When a new transition (s, a, r, s ′ ) is observed, edge (s, (s, a)) will be added to E in and ((s, a), r, s ′ ) will be added to E out . A visual example of this data graph can be seen in the Graph Backup diagram in Figure 7 (c) with tried (blue) action nodes only. Relating this data graph to Equation ( 6), we can see c(s, a) is the number of (s, (s, a)) edges in E in and f ((s, a, r, s ′ )) is the number of ((s, a), r, s ′ ) edges E out .

D DETAILS ABOUT EXPERIMENT SETUP

The Graph Backup and multi-step backup both use a depth limit of 5 for MiniGrid and MinAtar, and 10 for Atari100K. The breath limit for GB-limited is 50 for MiniGrid, 20 for Minatar and 10 for Atari100K. For MiniGrid and MinAtar, all backup methods are based on the vanilla DQN. The q network has 2 convolutions layers and 2 dense layers, and we follow the hyper-parameters of Rainbow (Hessel et al., 2018) with target network update frequency of 8000, ϵ-greedy exploration with ϵ = 0.02. The learning rate is 0.001 for MiniGrid, 0.000065 for Minatar. The discounting factor is 0.95 for MiniGrid and 0.99 for Minatar. The replay frequency is 1 for MiniGrid, and 4 for Minatar. Since we tested the algorithm in a data-efficient setting, the size of the replay buffer is set to be equal to the overall training steps. As for Atari100K, our baselines and Graph Backup agents are based on Data-Efficient Rainbow (van Hasselt et al., 2019) with the same hyper-parameters of Schwarzer et al. (2021) .

E GRAPH SPARSITY

Across different tasks, we can see a correlation between the density of the transition graph and the improvement of Graph Backup. For MiniGrid tasks where the possible number of states is limited, the Graph Backup brings much larger improvements, whereas, for MinAtar and Atari, the graph is sparse as there are multiple other objects besides the agent that can move in the game world. To analyse the graph density qualitatively, we propose the metric of the novel state ratio. The novel state ratio is the ratio between the number of non-duplicated states and the number of all states that the agent has seen. The novel state ratio will be 1 if there are no overlapping states in the transition graph, in which case Graph Backup reduces to Tree Backup. The average novel state ratio of MiniGrid/MinAtar/Atari are 0.006/0.298/0.927 respectively. The relative average improvements of Graph Backup compared to Tree Backup are 190%/89%/17% on these three group of tasks. The graph density alone, however, is not a reliable indicator to (linearly) predict how much improvement the Graph Backup can bring to a specific task. Although we know the Graph Backup will be the same as Tree Backup if the graph has no crossovers, more crossovers do not always guarantee larger performance improvement. When we investigate the normalised performancefoot_5 gain and the novel state ratio for each individual task we tested, the correlation is -0.29. Other factors like the structure of the graph and reward density or simply the performance upper bound can also affect the performance gain. As mentioned in Section 4.2, Graph Backup seems to bring more benefits in sparse reward tasks, which can be explained by its counterfactual reward propagation. And the structure pattern of the graph, given the same amount of crossovers, can also play a role.

F GRAPH STRUCTURE VISUALISATION

In order to investigate the topology of the graph, we visualize the exact graph structure of the whole transition graph. In Figure 4 , we show four representative transition graphs and leave the others to Figure 8 in the Appendix. We apply radial layouts as proposed by Wills (1999) , which scales well with the number of nodes and aligns well with the transition structure of most of the games. Since the common protocol for evaluating DRL agents in Atari games involves a random number of no-ops before the agent takes over the game, the initial states can vary for different episodes. To adjust for this, we create a hypothetical meta-initial state pointing to all initial states of a game. The meta-initial state is then treated as the root of the whole graph, displayed in the centre of each plot. A lot of transition graphs of Atari100K show interesting crossover structures that can be leveraged by Graph Backup. For example, the transition graph of Freeway forms a windmill-like pattern, where each blade corresponds to a group of trajectories that have connections within the group but between groups. There are also some tasks (e.g. Alien) where crossovers only happen at the start of the game. In such a case, the Graph Backup will not be helpful for most of the source states. We also find some MinAtar and Atari games have self-loop states (e.g. Asterix), represented as circles in the graph. After further investigations, we found the existence of self-loops is because some of the state transitions in MinAtar and Atari will not make observation changes (such as periodically moving objects). This actually violates the underlining assumptions of Graph Backup that the transitions must be Markovian, which can explain why Graph Backup is inferior to Tree Backup in some of the Tasks. On the other hand, the fact Graph Backup still outperforms multi-step methods on average suggests that Graph Backup is robust against minor violations of Markovian Assumption.

G FULL EXPERIMENT RESULTS

In Table 3 , we show the results of each individual task and the mean/median of the average performance in each group of tasks.

H ALL TRANSITION GRAPHS

In Table 3 , we visualise all the transition graphs of Atari100K.

I MIXED GRAPH BACKUP

By extending the N-Step-Q backup with the graph structure, we can get another backup target, named mixed Graph Backup (GB-mixed). GB-mixed only applies the max operator on the boundary nodes of the transition graph. Define T s def = {(ŝ, â, r, ŝ′ ) ∈ T |ŝ = s} and c(s) = T ∈Ts f (T ) similarly to before. The GB-mixed target for the state value is then: The GB-mixed target for the state-action value is then a frequency-weighted average of the next state target: Ḡs = 1 c(s) T ∈Ts f (T ) (r + γG ŝ′ ) if c(s) > 0 max a q θ ′ (s, a) otherwise (7) Ḡa s = 1 c(s,a) T ∈Ts,a f (T ) (r + γG ŝ′ ) if c(s, a) > 0 q θ ′ (s, a) otherwise (8) Similar to N-Step-Q backup, GB-mixed is not a strictly off-policy backup operator. The value of boundary states is estimated in an off-policy manner while the rewards of interior paths are on-policy, hence the name mixed Graph Backup. GB-mixed is a biased backup operator when evaluating the target policy. However, it can also be less noisy than GB-nested since there the nested max operators can propagate over-optimistic value estimation error from every step to the source state.

J GRAPH DENSITY OF ATARI100K

We also explore the role of graph density in the same set of tasks. In Figure 6 , we show the correlation between relative performance and novel state ratio, where each point represents a task in Atari100K. The relative performance is the human normalised score of GB-nested divided by that of Tree Backup. It indicates how many benefits the agent can get from leveraging the graph structure. There is a negative correlation of -0.22 between the novel states ratio and relative performance. Although the correlation is weaker than in the case of cross-group comparison, the graph density can still affect the effectiveness of Graph Backup. Figure 6 : Relative performance and novel state ratio

K MINATAR 1M

Table 2 shows the performance of different methods after 1M steps of training. Both one-step and Graph Backup achieves the means scores close to the DQN asymptotic performance reported by Young & Tian (2019) . This shows that even with more training data, Graph Backup is able to converge to the same level of performance as one-step backup. Surprisingly, though, the n-step-Q backup and Tree Backup both perform worse than one-step backup with more data. This can be explained by the innate problems of n-step-Q and nested max operators. Strictly speaking, n-step-Q is not an off-policy backup operator because it always uses online reward sequences for the estimation of its value, which can counter its advantages in longer reward propagation. This can be especially true for Minatar because its framerate is much lower than vanilla Atari and longer reward propagation may not be as important. For Tree Backup and Graph Backup, the nested max operator may cause an overestimation of values. We leave methods that address this problem for future work. With the over-estimation dealt with properly, Graph Backup may show even stronger performance in an asymptotic setting.

L COMPUTATIONAL OVERHEAD

In contrast to model-based planning methods, Graph Backup does not bring computational overhead at action selection. Therefore, during deployment, decision latency and the computational cost will be the same as for previous methods, which we believe is important in a Data-Efficient setting, where it's both important and required to leverage additional computing at training time to make up for the limited number of samples from the environment. On the other hand, the computational overhead of Graph Backup during training is nuanced and is heavily dependent on the implementation, the base algorithm, hyperparameters and the network architecture. In our implementation, the subgraph is constructed from the adjacency list (python dictionary), and the value for expanded state nodes can be computed in a batched way. This is quite different from model-based planning methods like MCTS where the tree expansion is conditioned on the policy and value estimation of parent nodes. Due to the easy parallelization of the neural network inference, the main bottleneck becomes the sub-graph expansion itself. Our implementation did the graph expansion and backup with python code and the different samples in a batch are processed sequentially, which makes the training 2-5 times slower than a one-step backup. However, this is just a simple proof of the algorithm's performance, and the implementation could easily be sped up drastically by JIT compilation or multi-threading (or multi-processing).

M VALUE ESTIMATION VISUALIZATIONS

In Figure 7 , we visualize the value estimates for all positions in an empty room environment. An agent starts at the bottom left corner and tries to reach the top right goal position. We fix the value of the goal position to be 1, and all other value estimations come from the Q network when the agent is in such a position and facing right. The discount factor γ = 0.95 is the same as in our empirical evaluations so the ground truth value for the initial position is 0.57. We fixed the random seeds of the agents and all of them will find the first reward in step 1957 and the training starts at step 2000. We can see graph backup quickly converge to the ground truth value estimation for half of the states at step 3000, whereas other backup methods only failed to correctly estimate the value of the initial state even after 30000 steps of training. An interesting observation is the value is not overestimated for Graph Backup even though it has a lot of nested max operators. On the other hand, values of some states are over-estimated by Tree Backup agent and one-step backup agents at step 30000. This can be explained by the combination of counterfactual credit assignments and online interactions. If the value of a state is overestimated, such excess value will quickly propagate to a lot of the interconnected states. Therefore, the agent will then actively try to reach the over-estimated states and such overestimation will then be corrected. Another interpretation is the overestimation comes from the extra noise of one-step and Tree Backup. In Figure 5 , even training the value from the logged data, Tree Backup and one-step backup will sometimes give higher value estimation than Graph Backup. 



Or sometimes the Huber Loss(Huber, 1992) In this case, both Tree Backup and n-step-Q backup can produce the estimation shown in the example tasks. In the Figure2, the noise comes from different rewards at the end of the trajectories. In fact, if there are loops in the graph, the situation can be even worse as the algorithm may never converge. Since the agent interacts with the environment every 4 frames, such preprocessing still assumes the transition to be Markovian. The scores of MiniGrid are treated as normalised as it is. The scores of MinAtar are normalised by step 5-million-steps DQN performance reported byYoung & Tian (2019)



Figure 1: (a) shows the transition graph Frostbite, an Atari game, extracted from a replace buffer of a Graph Backup agent after 100k steps. (b) shows backup diagrams for different backup targets. The circles are states, the blue squares represent the actions that have been observed for the given state node, and orange squares are actions where target net evaluation happened

Figure 2: Benefits of Graph Backup

Figure 3: Summary of training curve for Minigrid, Minatar and Atari100K. For Atari100K, we show both the mean and IQM of the human-normalised scores.

Figure 4: Transition graphs of selected Atari100K games, with data collected by a one-step DQN agent. As there are too many state nodes, we did not paint nodes directly but rather preserve edges (transitions), where circles represent self-loop transitions.

on boundary T new ← {t|∀t = (s, a, r, s ′ ) ∈ T , s ∈ S new } 5: Sample b transitions from T new with p(t) ∝ f (t), getting T pruned = {t 1 , t 2 , ..., t b } 6:Append state-action pairs to list L, {l max+1 , l max+2 , ...} = {(s, a)|∀(s, a, r, s ′ ) ∈ T pruned } 7:Update boundary states S new = {s ′ |∀(s, a, r, s ′ ) ∈ T pruned } 8: end for 9: return L

Figure 5: Mean and standard deviation of the value estimate for the same state-action pairs for a fixed dataset collected in MiniGrid Empty Room. The x-axis is the number of optimization steps.

Figure 7: Value Map of Empty Room

Figure 8: All Transition Graphs of Atari100K

Numeric summary of the performance. IQM stands for interquartile mean(Agarwal et al., 2021).

Minatar results after 1M steps of training. Values for each individual tasks are mean ± standard error.

Table2, we also show the results for 1 million steps of training with hyper-parameters still kept in the data-efficient setting. Interestingly, while Graph Backup still preserves a strong advantage over n-step-Q and Tree Backup in terms of aggregated metrics, one-step backup shows comparable performance against Graph Backup with more training. Looking at individual tasks, we find out Tree Backup shows inferior performance compared to one-step backup on Breakout, Asterix and Freeway. From a practical perspective, Minatar 1M results show Graph Backup achieves consistent performance in multiple scenarios while the baselines (one-step backup and Tree Backup) only show their advantage in settings with specific data availability. Here we study why Graph Backup can bring benefits to Atari games. Atari games have pixel-based observations, which have 255 (210 * 180 * 3) combinations if each pixel can take any value independently from other pixels. The first question is then: are there crossovers in just 100K transitions? To analyse the graph density quantitatively, we propose to measure the novel state ratio. The novel state ratio is the ratio between the amount of non-duplicated (i.e. unique) states and the amount of all states that the agent has seen. The novel state ratio will be 1 if there are no overlapping states in the transition graph, in which case Graph Backup reduces to Tree Backup. The average novel state ratio of Atari is 0.927, which means the graphs are usually quite sparse (The average novel state ratios of MiniGrid/MinAtar are 0.006/0.298 respectively). However, if we assume

Full Results of all the Tasks. The agent for MiniGrid and MinAtar is based on DQN, and the agent for Atari100K is based on Data-Efficient Rainbow. The default backup operator for rainbow is n-step-Q. The values in the table for MiniGrid and Minatar are raw game scores and those for Atari100K are human-normalised scores.

The unnormalised scores for Atari100K. Note that taking average of these scores will lead to a evaluation metric highly weighted by games with higher score range.

