VALUE MEMORY GRAPH: A GRAPH-STRUCTURED WORLD MODEL FOR OFFLINE REINFORCEMENT LEARNING

Abstract

Reinforcement Learning (RL) methods are typically applied directly in environments to learn policies. In some complex environments with continuous state-action spaces, sparse rewards, and/or long temporal horizons, learning a good policy in the original environments can be difficult. Focusing on the offline RL setting, we aim to build a simple and discrete world model that abstracts the original environment. RL methods are applied to our world model instead of the environment data for simplified policy learning. Our world model, dubbed Value Memory Graph (VMG), is designed as a directed-graph-based Markov decision process (MDP) of which vertices and directed edges represent graph states and graph actions, separately. As state-action spaces of VMG are finite and relatively small compared to the original environment, we can directly apply the value iteration algorithm on VMG to estimate graph state values and figure out the best graph actions. VMG is trained from and built on the offline RL dataset. Together with an action translator that converts the abstract graph actions in VMG to real actions in the original environment, VMG controls agents to maximize episode returns. Our experiments on the D4RL benchmark show that VMG can outperform state-of-the-art offline RL methods in several goal-oriented tasks, especially when environments have sparse rewards and long temporal horizons.

1. INTRODUCTION

Humans are usually good at simplifying difficult problems into easier ones by ignoring trivial details and focusing on important information for decision making. Typically, reinforcement learning (RL) methods are directly applied in the original environment to learn a policy. When we have a difficult environment like robotics or video games with long temporal horizons, sparse reward signals, or large and continuous state-action space, it becomes more challenging for RL methods to reason the value of states or actions in the original environment to get a well-performing policy. Learning a world model that simplifies the original complex environment into an easy version might lower the difficulty to learn a policy and lead to better performance. In offline reinforcement learning, algorithms can access a dataset consisting of pre-collected episodes to learn a policy without interacting with the environment. Usually, the offline dataset is used as a replay buffer to train a policy in an off-policy way with additional constraints to avoid distribution shift problems (Wu et al., 2019; Fujimoto et al., 2019; Kumar et al., 2019; Nair et al., 2020; Wang et al., 2020; Peng et al., 2019) . As the episodes also contain the dynamics information of the original environment, it is possible to utilize such a dataset to directly learn an abstraction of the environment in the offline RL setting. To this end, we introduce Value Memory Graph (VMG), a graph-structured world model for offline reinforcement learning tasks. VMG is a Markov decision process (MDP) defined on a graph as an abstract of the original environment. Instead of directly applying RL methods to the offline dataset collected in the original environment, we learn and build VMG first and use Figure 1 : Demonstration of a successful episode where a robot trained in the dataset "kitchen-partial" accomplishes 4 subtasks in sequence guided by VMG. Vertex values are shown via color shade. By searching graph actions that lead to the high-value future region (darker blue) calculated by value iteration on the graph, VMG controls the robot arm to maximize episode rewards and finish the task. it as a simplified substitute of the environment to apply RL methods. VMG is built by mapping offline episodes to directed chains in a metric space trained via contrastive learning. Then, these chains are connected to a graph via state merging. Vertices and directed edges of VMG are viewed as graph states and graph actions. Each vertex transition on VMG has rewards defined from the original rewards in the environment. To control agents in environments, we first run the classical value iteration algorithm (Puterman, 2014) once on VMG to calculate graph state values. This can be done in less than one second without training a value neural network thanks to the discrete and relatively smaller state and action spaces in VMG. At each timestep, VMG is used to search for graph actions that can lead to high-value future states. Graph actions are directed edges and cannot be directly executed in the original environment. With the help of an action translator trained in supervised learning (e.g., Emmons et al. (2021) ) using the same offline dataset, the searched graph actions are converted to environment actions to control the agent. An overview of our method is shown in Fig. 1 . Our contribution can be summarized as follows: • We present Value Memory Graph (VMG), a graph-structured world model in offline reinforcement learning setting. VMG represents the original environments as a graph-based MDP with relatively small and discrete action and state spaces. • We design a method to learn and build VMG on an offline dataset via contrastive learning and state merging. • We introduce a VMG-based method to control agents by reasoning graph actions that lead to high-value future states via value iteration and convert them to environment actions via an action translator. • Experiments on the D4RL benchmark show that VMG can outperform several state-of-theart offline RL methods on several goal-oriented tasks with sparse rewards and long temporal horizons.

2. RELATED WORK

Offline Reinforcement Learning One crucial problem in offline RL is how to avoid out-of-thetraining-distribution (OOD) actions and states that decrease the performance in test environments (Fujimoto et al., 2019; Kumar et al., 2019; Levine et al., 2020) 2021)) use hierarchical policies to control agents with a high-level policy that generate commands like abstract actions or skills, and a low-level policy that converts them to concrete environment actions. Our method can be viewed as a hierarchical RL approach with VMG-based high-level policy and a low-level action translator. Compared to previous methods which learn high-level policies in environments, our high-level policy is instead trained in VMG via value iteration without additional neural network learning.

Model-based Reinforcement Learning

Recent research in model-based reinforcement learning (MBRL) has shown a significant advantage (Ha & Schmidhuber, 2018; Janner et al., 2019; Hafner et al., 2019; Schrittwieser et al., 2020; Ye et al., 2021) in sample efficiency over model-free reinforcement learning. In most of the previous methods, world models are designed to approximate the original environment transition. In contrast, VMG abstracts the environment as a simple graph-based MDP. Therefore, we can apply RL methods directly to VMG for simple and fast policy learning. As we demonstrate later in our experiments, this facilitates reasoning and leads to good performance in tasks with long temporal horizons and sparse rewards. Representation Learning Contrastive learning methods learns a good representation by maximizing the similarity between related data and minimizing the similarity of unrelated data (Oord et al., 2018; Chen et al., 2020; Radford et al., 2021) in the learned representation space. Bisimulation-based methods like Zhang et al. (2020) learn a representation with the help of bisimulation metrics (Ferns & Precup, 2014; Ferns et al., 2011; Bertsekas & Tsitsiklis, 1995) measuring the 'behavior similarity' of states w.r.t. future reward sequences given any input action sequences. In VMG, we use a contrastive learning loss to learn a metric space encoding the similarity between states as L2 distance.

3. VALUE MEMORY GRAPH (VMG)

Our world model, Value Memory Graph (VMG), is a graph-structured Markov decision process constructed as a simplified version of the original environment with discrete and relatively smaller state-action spaces. RL methods can be applied on the VMG instead of the original environment to lower the difficulty of policy learning. To build VMG, we first learn a metric space that measures the reachability among the environment states. Then, a graph is built in the metric space from the dataset as the backbone of our VMG. In the end, a Markov decision process is defined on the graph as an abstract representation of the environment.

3.1. VMG METRIC SPACE LEARNING

VMG is built in a metric space where the L2 distance represents whether one state can be reached from another state in a few timesteps. The embedding in the metric space is based on a contrastive-learning mechanism demonstrated in Fig. 2a . We have two neural networks: a state encoder Enc s : s → f s that maps the original state s to a state feature f s in the metric space, and an action encoder Enc a : f s , a → ∆f s,a that maps the original action a to a transition ∆f s,a in the metric space conditioned on the current state feature f s . Given a transition triple (s, a, s ′ ), we add the transition ∆f s,a to the state feature f s as the prediction of the next state feature fs ′ = f s + ∆f s,a . The prediction is encouraged to be close to the ground truth f s ′ and away from other unrelated state features. Therefore, we use the following learning objective to train Enc s and Enc a : L c = D 2 ( fs ′ , f s ′ ) + 1 N max(m -D 2 ( fs ′ , f sneg,n ), 0) Here, D(•, •) denotes the L2 distance. s neg,n denotes the n-th negative state. Given a batch of transition triples (s i , a i , s ′ i ) randomly sampled from the training set and a fixed margin distance m, we use all the other next states s ′ j|j̸ =i as the negative states for s i and encourage fs ′ i to be at least m away from negative states in the metric space. In addition, we use an action decoder Dec a : f s , ∆f s,a → ã to reconstruct the action from the transition ∆f s,a conditioned on the state feature f s as shown in Fig. 2b . This conditioned auto-encoder structure encourages the transition ∆f s,a to be a meaningful representation of the action. Besides, we penalize the length of the transition when it is larger than the margin m to encourage adjacent states to be close in the metric space. Therefore, we have the additional action loss L a shown below. L a = D 2 (ã, a) + max(∥∆f s,a ∥ 2 -m, 0) L metric , the total training loss for metric learning, is the sum of the contrastive and action losses. L metric = L c + L a (3)

3.2. CONSTRUCT THE GRAPH IN VMG

To construct the graph in VMG, we first map all the episodes in the training data to the metric space as directed chains. Then, these episode chains are combined into a graph with a reduced number of state features. This is done by merging similar state features into one vertex based on the distance in the metric space. The overall algorithm are visualized in Fig. 3a and can be found in Appx.B.1. Given a distance threshold γ m , a vertex set V, and a checking state s i , we check whether the minimal distance in the metric space from the existing vertices to the checking state s i is smaller than γ m . If not or if the vertex set is empty, we set the checking state s i as a new vertex v J and add it to V. This process is repeated over the whole dataset. After the vertex set V is constructed, each state s i can be classified into a vertex v j of which the distance in the metric space is smaller than γ m . In the training In Fig. 3a , three episodes are mapped as three chains in the metric space colored differently. We merge nodes that are close to each other together and combine these chains into a directed graph. In Fig. 3b , the graph reward R G (v j1 , v j2 ) of the action from the green vertex v j1 to the blue vertex v j2 is defined as the average over rewards in the original episodes. set, each state transition (s i , a i , s ′ i ) represents a directed connection from s i to s ′ i . Therefore, we create the graph directed edges from the original transitions. For any two different vertices v j1 , v j2 in V, if there exist a transition (s i , a i , s ′ i ) where s i and s ′ i can be classified into v j1 and v j2 , respectively, we add a directed edge e j1→j2 from v j1 to v j2 .

3.3. DEFINE

A GRAPH-BASED MDP VMG is a Markov decision process (MDP) (S G , A G , P G , R G ) defined on the graph. S G , A G , P G , R G denotes the state set, the action set, the state transition probability, and the reward of this new graph MDP, respectively. Based on the graph, each vertex is viewed as a graph state. Besides, we view each directed connection e j1→j2 starting from a vertex v j1 as an available graph action in v j1 . Therefore, the graph state set S G equals the graph vertex set V and the graph action set is the graph edge set E. For the graph state transition probability P G from v j1 to v j2 , we define it as 1 if the corresponding edge exists in E otherwise 0. Therefore, P G (v j2 |v j1 , e j1→j2 ) = 1 if e j1→j2 ∈ E 0 otherwise (4) We define the graph reward of each possible state transition e j1→j2 as the average over the original rewards from v j1 to v j2 in the training set D, plus "internal rewards". The internal reward comes from the original transitions that are inside v j1 or v j2 after state merging. An example of graph reward definition is visualized in Fig. 3b . Concretely, R j1→j2 = avg{r i |∀s i classified to v j1 , s ′ i classified to v j2 , (s i , a i , r i , s ′ i ) ∈ D} (5) R G (v j1 , v j2 ) = 1 2 R j1→j1 + R j1→j2 + 1 2 R j2→j2 if e j1→j2 ∈ E Not defined otherwise (6) Note that the rewards of graph transitions outside of E are not defined, as these transitions will not happen according to Eq.4. For internal rewards where both the source s i and the target s ′ i of the original transition (s i , s ′ i ) are classified to the same vertex, we split the reward into two and allocate them to both incoming and outgoing edges, respectively. This is shown as 1 2 R j1→j1 and 1 2 R j2→j2 in Eq.6. Now we have a well-defined MDP on the graph. This MDP serves as our world model VMG.

3.4. HOW TO USE VMG

VMG, together with an action translator, can generate environment actions that control agents to maximize episode returns. We first run the classical RL method value iteration (Puterman, 2014) on VMG to compute the value V (v j ) of each graph state v j . This can be done in one second without learning an additional neural-network-based value function due to VMG's finite and discrete state-action spaces. To guide the agent, VMG provides a graph action that leads to high-value graph states in the future at each time step. Due to the distribution shift between the offline dataset and the environment, there can be gaps between VMG and the environment. Therefore, the optimal graph action calculated directly by value iteration on VMG might not be optimal in the environment. We notice that instead of greedily selecting the graph actions with the highest next state values, searching for a good future state after multiple steps first and planning a path to it can give us a more reliable performance. Given the current environment state s c , we first find the closest graph state v c on VMG. Starting from v c , we search for N s future steps to find the future graph state v * with the best value. Then, we plan a shortest path P = [v c , v c+1 , ..., v * ] from v c to v * via Dijkstra (Dijkstra et al., 1959) on the graph. We select the N sg -th graph state v c+Nsg and make an edge e c→c+Nsg as the searched graph action. The graph action e c→c+Nsg is converted to the environment action a c via an action translator: a c = T ran(s c , v c+Nsg ). The pseudo algorithm can be found in Appx.B.2. The action translator T ran(s, s ′ ) reasons the executed environment action given the current state s and a state s ′ in the near future. T ran(s, s ′ ) is trained purely in the offline dataset via supervised learning and separately from the training of VMG. In detail, given an episode from the training set and a time step t, we first randomly sample a step t + k from the future K steps. k ∼ U nif orm(1, K). Then, T ran(s, s ′ ) is trained to regress the action a t at step t given the state s t and the future state s t+k using a L2 regression loss L T ran = D 2 (T ran(s t , s t+k ), a t ). Note that when k = 1, p(a t |s t , s t+k ) is determined purely by the environment dynamics and T ran(s, s ′ ) becomes an inverse dynamics model. As k increase, the influence of the behavior policy that collects the offline dataset on p(a t |s t , s t+k ) will increase. Therefore, the sample range K should be small to reflect the environment dynamics and reduce the influence of the behavior policy. In all of our experiments, K is set to 10.

4.1. PERFORMANCE ON OFFLINE RL BENCHMARKS

Test Benchmark We evaluate VMG on the widely used offline reinforcement learning benchmark D4RL (Fu et al., 2020) . In detail, we test VMG on three domains: Kitchen, AntMaze, and Adorit. In Kitchen, a robot arm in a virtual kitchen needs to finish four subtasks in an episode. The robot receives a sparse reward after finishing each subtask. D4RL provides three different datasets in Kitchen: kitchen-complete, kitchen-partial, and kitchen-mixed. In AntMaze, a robot ant needs to go through a maze and reaches a target location. The robot only receives a sparse reward when it reaches the target. D4RL provides three mazes of different sizes. Each of them contains two datasets. In Adroit, policies control a robot hand to finish tasks like rotating a pen or opening a door with dense rewards. For evaluation, D4RL normalizes all the performance of different tasks to a range of 0-100, where 100 represents the performance of an "expert" policy. More benchmark details can be found in D4RL (Fu et al., 2020) and Appx.A. Baselines We mainly compare our method with two state-of-the-art methods CQL (Kumar et al., 2020) and IQL (Kostrikov et al., 2021b ) in all the above-mentioned datasets. Both CQL and IQL are based on Q-learning with constraints on the Q function to alleviate the OOD action issue in the offline setting. In addition, we also report the performance of BRAC-p (Wu et al., 2019) , BEAR (Kumar et al., 2019) , DT (Chen et al., 2021) , and AWAC (Nair et al., 2020) in the datasets they used. Performance of behavior cloning (BC) is from (Kostrikov et al., 2021b) . Hyperparameters In all the experiments, the dimension of metric space is set to 10. The margin m in Eq.1 and 2 is 1. The distance threshold γ m is set to 0.5, 0.8, and 0.3 in Kitchen, AntMaze, and Adorit, separately. Hyperparameters are selected from 12 configurations. We use Adam optimizer (Kingma & Ba, 2014) with a learning rate 10 -3 , train the model for 800 epochs with batch size 100, and select the best-performing checkpoint. More details about hyperparameters and experiment settings are in Appx.D. Performance Experimental results are shown in Tab.1. VMG's scores are averaged over three individually trained models and over 100 individually evaluated episodes in the environment. In general, VMG outperforms baseline methods in Kitchen and AntMaze and shows competitive performance in Adroit. Note that a good reasoning ability in Kitchen and AntMaze domains is crucial as the rewards in both domains are sparse, and the agent needs to plan over a long time before getting reward signals. In AntMaze, baseline methods perform relatively well in the smallest maze 'umaze', which requires less than 200 steps to solve. In the maze 'large' where episodes can be longer than 600 steps, the performance of baseline methods drops dramatically. VMG keeps a reasonable score in all three mazes, which suggests that simplifying environments to a graph-structured MDP helps RL methods better reason over a long horizon in the original environment. Adroit is the most challenging domain for all the methods in D4RL with a high-dimensional action space. VMG still shows competitive performance in Adroit compared to baselines. Experiments show that learning a policy directly in VMG helps agents perform well, especially in environments with sparse rewards and long temporal horizons.

4.2. UNDERSTANDING VALUE MEMORY GRAPH

To analyze whether VMG can understand and represent the structure of the task space correctly, we visualize an environment, the corresponding VMG, and their relationship in Fig. 4 . We study the task "antmaze-large-diverse" shown in Fig. 4a The blue pen is rotated to the same orientation as the green one. 𝛾 . learn a meaningful representation of the task. Another VMG visualization in the more complicated task "pen-human" is shown in Fig. 5 and and more visualizations can be found in Appx.J. In offline RL, policies are trained to master skills that can maximize accumulated returns via an offline dataset. When the dataset contains other skills that don't lead to high rewards, these skills will be simply ignored. We name them ignored skills. We can retrain a new policy to master ignored skills by redefining new reward functions correspondingly. However, rerunning RL methods with new reward functions is cumbersome in the original complex environment, as we need to retrain the policy network and Q/value networks from scratch. In contrast, rerunning value iteration in VMG with new reward functions takes less than one second without retraining any neural networks. Note that the learning of VMG and the action translator is reward-free. Therefore, we don't need to retrain VMG but recalculate graph rewards using Eq.6 with new reward functions.

4.3. REUSABILITY OF VMG WITH NEW REWARD FUNCTIONS

We design an experiment in the dataset "kitchen-partial" to verify the reusability of VMG with new reward functions. In this dataset, the robot only receives rewards in the following four subtasks: open a microwave, move a kettle, turn on a light, and open a slide cabinet. Besides, there are training episodes containing ignored skills like turning on a burner or opening a hinged cabinet. We first train a model in the original dataset. Then, we define a new reward function, where only ignored skills have positive rewards and relabel training episodes correspondingly. After that, we recalculate graph rewards using Eq.6, rerun value iteration on VMG, and test our agent. Experimental results in Tab.2 show that agents can perform ignored skills after rerunning value iteration in the original VMG with recalculated graph rewards without retraining any neural networks.

4.4. ABLATION STUDY

Distance Threshold The distance threshold γ m directly controls the "radius" of vertices and affects the size of the graph. We demonstrate how γ m affects the performance in the task "kitchen-partial" in Fig. 6 . The dataset size of "kitchen-partial" is 137k. A larger γ m can reduce the number of vertices but hurts the performance due to information loss. More results can be found in Appx.E. Graph Reward Here we study how will different designs of the graph reward R G (v j1 , v j2 ) affect the final performance. In addition to the original version defined in Eq.5 and Eq.6 that averages over the environment rewards, we try maximization and summation and denote them as R G,max and R G,sum , separately. Besides, we also study the effectiveness of the the internal reward through the following three variants of Eq.6: R G,rm = R j1,j2 , R G,rm,h = R j1,j2 + 1 2 R j2,j2 , R G,rm,t = 1 2 R j1,j1 +R j1,j2 , if e j1→j2 ∈ E. Experimental results shown in Tab.3 suggest that the original design of the graph rewards represent the environment well and leads to the best performance. Effectiveness of Action Decoder The action decoder is trained to reconstruct the original action from the transition in the metric space conditioned on the state feature shown in Fig. 2b . In this way, the training of the action decoder encourages transitions in the metric space to better represent actions and leads to a better metric space. To show the effectiveness of the action decoder, we train a VMG variant without the action decoder and show the results in Tab.4. The performance without the action decoder drops in all three tested tasks, especially in 'kitchen-partial' (from 68.8 to 15.4). The results verify our design choice.

Multi-Step Search

In Tab.4, we list the performance of our method without Multiple-Step Search. Compared to the original version, we observe a performance drop in 'kitchen-partial' and 'antmazemedium-play' and similar performance in 'pen-human', which suggests that instead of greedily searching one step in the value interaction results, searching multiple steps first to find a high value state in the long future and then plan a path to it via Dijkstra can help agents perform better. We think the advantage might caused by the gap between VMG and the environment. An optimal path on VMG searched directly by value iteration may not be still optimal in the environment. At the same time, a shorter path from Dijkstra helps reduce cumulative errors and uncertainty, and thus increases the reliability of the policy. Limitations As an attempt to apply graph-structured world models in offline reinforcement learning, VMG still has some limitations. For example, VMG doesn't learn to generate new edges in the graph but only creates edges from existing transitions in the dataset. This might be a limitation when there is not enough data provided. In addition, VMG is designed in an offline setting. Moving to the online setting requires further designs for environment exploring and dynamic graph expansion, which can be interesting future work. Besides, the action translator is trained via conditioned behavior cloning. This may lead to suboptimal results in tasks with important low-level dynamics like gym locomotion (See Appx.H). Training the action translator by offline RL methods may alleviate this issue.

5. CONCLUSION

We present Value Memory Graph (VMG), a graph-structured world model in offline reinforcement learning. VMG is a Markov decision process defined on a directed graph trained from the offline dataset as an abstract version of the environment. As VMG is a smaller and discrete substitute for the original environment, RL methods like value iteration can be applied on VMG instead of the original environment to lower the difficulty of policy learning. Experiments show that VMG can outperform baselines in many goal-oriented tasks, especially when the environments have sparse rewards and long temporal horizons in the widely used offline RL benchmark D4RL. We believe VMG shows a promising direction to improve RL performance via abstracting the original environment and hope it can encourage more future works. 

A ENVIRONMENT DETAILS

The datasets in D4RL (Fu et al., 2020) is under CC BY license and the related code is under Apache 2.0 License. We use the latest version of the datasets (v1/v0/v1 for AntMaze, Kithcen, Adroit, separately). Different versions of datasets contain exactly the same training transitions. The newer version fixes some bugs in the meta data information like the wrong termination steps. Performance is measured by returns normalized to the range between 0 and 100 defined by the D4RL benchmark [9] . In detail, normalized score = 100 × score-random score expert score-random score . A score of 100 corresponds to the average returns of a domain-specific expert. For AntMaze, and Kitchen, an estimate of the maximum score possible is used as the expert score. For Adroit, this is estimated from a policy trained with behavioral cloning on human-demonstrations and online fine-tuned with RL in the environment. For more details about the datasets please refer to D4RL (Fu et al., 2020) .

B ALGORITHMS B.1 GRAPH CONSTRUCTION

The detailed algorithm of graph construction is shown in Alg.1. Details of Dijkstra When we use Dijkstra in Sec.3.4 to plan a path P from v c to v * , we define weights to each edge to make sure P is both short and high-rewarded. The weights used to plan the path P are based on rewards. For each edge e j1→j2 , we define the edge weight w j1→j2 as the gap between the maximal graph reward and the edge reward and denote the weight set as W. Algorithm 1: Graph Construction Input :Training Set D = {(s i , a i , r i , s ′ i )|i = 1, 2, ..., N }, Empty vertices set V = {}, Current vertex index J = 1, Distance threshold γ m , Empty edges set E = {} 1 for (s i , a i , r i , s ′ i ) in D do 2 f si = Enc s (s i ) 3 Compute the distance d ij between f si and f vj for every f vj in V 4 if min{d ij |f sj in V} > γ m or J = 1 then 5 v J ← s i , f v J ← f si 6 V.append((v J , f v J )) 7 J ← J + 1 8 end 9 end 10 for (s i , a i , r i , s ′ i ) in D do 11 Find v j1 , v j2 that s i and s ′ i are classified to in V, respectively 12 if v j1 ̸ = v j2 w j1→j2 = max{R G (v j3 , v j4 )|∀e j3→j4 ∈ E} -R G (v j1 , v j2 ).

C ARCHITECTURE OF NEURAL NETWORKS

For all the networks including the state encoder Enc s , the action encoder Enc a , the action decoder Dec a , and the action translator T ran(s, s ′ ), we use a 3-layer MLP with hidden size 256 and ReLU activation functions. 

D EXPERIMENT SETTINGS AND HYPERPARAMETERS

Our model is trained in a single RTX Titan GPU in about 1.5 hours. For inference, building the graph from clustering takes about 0.5-2 minutes before the evaluation. After that, it takes about 0.5-10 minutes to evaluate 100 episodes. We implement VMG on the top of the offline RL python package d3rlpy (Takuma Seno, 2021) with MIT license. In all the experiments, We use Adam optimizer (Kingma & Ba, 2014) with a learning rate 10 -3 . Batch size is 100. Each model is trained for 800 epochs. We save models per 50 epochs and report the performance of the best one evaluated in the environment from the checkpoints saved from the 500th to the 800th epochs. The remaining hyperparameter settings can be found in Tab.5. N s = ∞ means we search the future steps till the end of the graph. For the domain Kitchen, the hyperparameters are tuned in "kitchen-partial". For AntMaze it is "antmaze-umaze-diverse". For Adroit, hyperparameters are tuned individually. We use the environment to tune the hyperparameters. We searched four hyperparameters in our main experiments: γ m in [0.3, 0.5, 0.8, 1.0, 1.2], reward discount in [0.8, 0.95], N sg in [1, 2, 3], N s in [12, ∞] . Hyperparameters are searched one by one, in total 12 configurations. For hyperparameters like batch size or learning rate, we follow the default one in the RL library d3rlpy. The dimension of the metric space is set to 10 in all the experiments. Tuning the hyperparameters offline is an ongoing and important research topic in offline RL, and we left it for future work. More experimental results of the distance threshold γ m in the tasks "antmaze-medium-play" and "pen-cloned" can be found in Fig. 7 . Results suggest that the model is not so sensitive to γ m if it is not too large. et al., 1996) ) are slow in this case (up to hours for BIRCH on machines with Intel Xeon Gold 6242). Therefore, we compare with the classical K-means (Lloyd, 1982) implemented on Faiss (Johnson et al., 2019) library with the GPU support in both the AntMaze domain and the Kitchen domains. Faiss-based K-means can be finished up to 20 seconds in our setting. Our merging method takes up to 1 minute. Experimental results are shown in Tab.6. VMG created by our merging method performs better than the one created by K-means. The vertices of our method can be viewed as hyperspheres in the metric space with the same radius γ m . In contrast, K-means cannot directly specify the size of each cluster, which can result in vertices with different "volumes" in the metric space. This might lead to undesired distortion in the graph and reduce the performance. As K-means doesn't have a parameter to control the size of the clusters directly, we have to search for the best number of clusters for every dataset. The number of clusters used in K-means is shown in Tab.7. 

E.3 INFLUENCE OF m

The value of the margin m in Eq.1 and Eq.2 implicitly defines the minimal distance of negative state pairs in the learned metric space. To study the influence of m on the performance, here we set m to 0.5, 1, and 2 in the datasets antmaze-medium-play, kitchen-partial, and pen-human and show the results in Tab.8. In the antmaze experiment, performance becomes better with a larger m. But in the task pen-human, a smaller m gives us better results. In kitchen-partial, m=1 shows the best performance. Experimental results suggest that m=1 is a reasonable value for the tasks we evaluate on. And if we tune m separately, it is possible to improve the performance. Table 8 : Influence of the margin m.

E.4 INFLUENCE OF DISCOUNT FACTOR

Here we study how different discount factor values will affect the performance of VMG. We set the discount factor to the values 0.8, 0.95, and 0.99. Experiments in Tab.9 show that 0.99 leads to better performance in pen-human and comparable performance in kitchen-partial. In antmaze-medium-play, 0.99 performs worse than 0.8 and 0.95, which suggest that a small discount factor in antmaze might help reduce cumulative errors. 

E.5 INFLUENCE OF THE DIMENSION OF THE METRIC SPACE

Here we study how different numbers of the metric space dimension will affect the performance of VMG. We set the metric space dimensions to 5, 10, and 20. Experiments in Tab.10 show that models with latent space dimensions 10 and 20 perform better than those with 5, which suggests that a reasonable performance requires big enough dimensions of the latent space to represent the states and actions better. Besides, space with 10 dimensions works better than 20 in kithcen-partial but worse than 20 in pen-human, this suggests the performance has space to improve if we tune the dimensions individually in each task. Table 10 : Influence of the metric space dimension.

E.6 INFLUENCE OF K

The hyperparameter K used in training the action translator in Sec.3.4 defines the range of the future states the action translator conditions on during training. To study the influence of K, here show experiments with K=5, 10, and 20 in Tab.11. We notice that K = 5 doesn't work in antmaze-medium and kitchen-partial, which suggests that K=5 is not big enough to cover 2 steps in the graph transition. In addition, the experiments with K = 20 show better results than K = 10 in kitchen-partial and pen-human. In antmaze-medium-play, K = 10 performs the best. Experimental results suggest that a big enough K helps the model perform better. F TRAINING CURVE Fig. 8 shows the training curves of the contrastive loss L c , the action loss L a , and the anction translator loss L T ran in tasks kitchen-partial, antmaze-medium-play, and pen-human. 

H EXPERIMENTS IN GYM LOCOMOTION TASKS

VMG is introduced to help agents reason the long future better so as to improve their performance in complex environments with sparse rewards and large search space due to long temporal horizons and continuous state/action spaces. VMG may not help in gym locomotion tasks, as these tasks don't require agents to reason the long future and thus are out of our scope. Gym locomotion tasks provide rich and dense reward signals, and the motion patterns to learn in these tasks are periodic and short. Therefore, the problems VMG designed to solve are not an issue here. Our performance in these tasks is expected to be close to behavior cloning, since the low-level component, the action translator, is trained via (conditioned) behavior cloning. The action translator is used to handle local dynamics that are not modeled in VMG. Here we run new experiments in these tasks and show the results below. Experimental results verify our assumption. Results and analysis suggest that an improved design and/or learning strategy of the action translator might help improve the performance. For example, training the action translator using conditioned offline RL methods instead of conditioned behavior cloning. However, this is orthogonal to our VMG framework contribution to future reasoning, and we leave it for future work. 

I FUTURE WORKS

There are several directions to improve VMG. Building hierarchical graphs to model different levels of environment structures might help represent the environment better. For example, if a robot needs to cook a meal, we might have a high-level graph to represent abstract tasks like washing vegetables, cutting vegetables, etc. A low-level graph can be used to guide a goal-conditioned policy. This might improve the high-level planning of the tasks. Extending VMG into the online setting is also an important future step. In online reinforcement learning, data with new information is collected throughout the training stage. Therefore, the graph needs to have a mechanism to continually expand and include the new information. Besides, exploration is a crucial component in online reinforcement learning. If we model the uncertainty of the graph, VMG can be used to guide the agent to explore regions with high uncertainty to explore more effectively. Combined with Monte Carlo tree search on VMG might also help policy explore and exploit better.

J VISUALIZATION OF VMG

More visualization of VMG in different tasks are demonstrated in Fig. 9 , 10, 11, 12. An episode is denoted as a green path on the graph with a "+" sign at the end. 



. Recent works likeWu et al. (2019);Fujimoto et al. (2019);Kumar et al. (2019);Nair et al. (2020);Wang et al. (2020);Peng et al. (2019) directly penalize the mismatch between the trained policy and the behavior policy via an explicit density model or via divergence. Another methods likeKumar et al. (2020); Kostrikov et al. (2021a;b) constrains the training via penalizing the Q function. Model-based reinforcement learning methods like Yu et al. (2020; 2021); Kidambi et al. (2020) constrains the policy to the region of the world model that is close to the training data. Compared to previous methods, graph actions in VMG always control agents to move to graph states, and all the graph states come from the training dataset. Therefore, agents stay close to states from the training distribution naturally. Hierarchical Reinforcement Learning Hierarchical RL methods (e.g., Savinov et al. (2018); Nachum et al. (2018); Eysenbach et al. (2019); Huang et al. (2019); Liu et al. (2020); Mandlekar et al. (2020); Yang et al. (2020); Emmons et al. (2020); Zhang et al. (

Figure 2: The training pipeline of the state encoder Enc s and the action encoder Enc a to build the memory map. Enc s converts original states s into points in the memory map. Enc a maps actions a as transitions in the memory map.

Figure3: Create a graph and define rewards in VMG. In Fig.3a, three episodes are mapped as three chains in the metric space colored differently. We merge nodes that are close to each other together and combine these chains into a directed graph. In Fig.3b, the graph reward R G (v j1 , v j2 ) of the action from the green vertex v j1 to the blue vertex v j2 is defined as the average over rewards in the original episodes.

Figure 4: An example of VMG learned from the dataset 'antmaze-large-diverse'. Fig.4a shows the environment with the target location highlighted by a red circle. VMG is visualized via UMAP in Fig.4b. Graph state values are represented by color shades with higher values in darker blue. Graph states that are close to the target have high values calculated by value iteration. In Fig.4c, graph states are mapped to the corresponding maze locations to show the relationship.

Figure 5: VMG and a successful trial in the task "pen-human".The blue pen is rotated to the same orientation as the green one.

Figure 6: Influence of γ m in "kitchen-partial" in performance and VMG size.

and the connection e j1→j2 ̸ ∈ E then 13 E.append(e j1→j2 ) of policy execution is shown in Alg.2.

Policy Execution Input :Current state s c , State encoder Enc s , Action translator T ran, Vertex and edge sets in VMG (V, E), Vertices value V , Edge weight W 1 f sc = Enc s (s c ) 2 v c = arg min vj |(vj ,fj )∈V D(f sc , f j ) 3 Search future horizon of N s steps starting from v c and select the best value vertex v * 4 Compute the weighted shortest path P from v c to v * via Dijkstra. P = [v c , v c+1 , ..., v * ] 5 a c = T ran(s c , v c+Nsg )

Figure 7: More results of the influence of γ m in performance and VMG size

Figure 8: Training curve of the contrastive loss L c , action loss L a , and action translator loss L T ran .

Figure 9: Visualization of VMG in different tasks

Experimental results on domains Kitchen, AntMaze, and Adroit from D4RL benchmark. VMG outperforms baselines in Kitchen and AntMaze where only sparse rewards are provided and achieves comparable performance in Adroit. Results and the standard deviation are calculated over three trained models.

VMG success rate of ignored skills. Agents can perform these skills by rerunning value iteration with the new reward function in a trained VMG.

Ablation study of graph reward design in VMG. The original design gives us the best performance.

Ablation study of contrastive loss, action decoder, and Dijkstra search.

Construct the Graph in VMG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 Define a Graph-Based MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.4 How to Use VMG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Performance on Offline RL Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 6 4.2 Understanding Value Memory Graph . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.3 Reusability of VMG With New Reward Functions . . . . . . . . . . . . . . . . . . Distance Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 E.2 State Merging Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 E.3 Influence of m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 E.4 Influence of Discount Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 E.5 Influence of the Dimension of the Metric Space . . . . . . . . . . . . . . . . . . . 18 E.6 Influence of K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Detailed Hyperparameter Setting

Ablation study of different state merging methods. Our original design gives us better performance.

Number of clusters used in K-means

Influence of the discount factor.

Average number of environment transitions per graph transition.

Peformance of VMG in gym locomotion tasks. The performance of VMG is expected to be closed to behavior cloning.

ACKNOWLEDGMENTS

We would like to thank Ahmed Hefny and Vaneet Aggarwal for their helpful feedback and discussions on this work.

availability

//github.com/TsuTikgiau/

I Future Works 20

J Visualization of VMG 20

