ON THE IMPORTANCE OF THE POLICY STRUCTURE IN OFFLINE REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

Offline reinforcement learning (RL) has attracted a great deal of attention recently as an approach to utilizing past experience to learn a policy. Recent studies have reported the challenges of offline RL, such as estimating the values of actions that are out of the data distribution. To mitigate the issues of offline RL, we propose an algorithm that leverages a mixture of deterministic policies. With our framework, the state-action space is divided by learning discrete latent variables, and sub-policies corresponding to each region are trained. The proposed algorithm, which we call Value-Weighted Variational Auto-Encoder (V2AE), is derived by considering the variational lower bound of the offline RL objective function. The aim of this work is to shed light on the importance of the policy structure in offline RL. We show empirically that the use of the proposed mixture policy can reduce the accumulation of the critic loss in offline RL, which was reported in previous studies. Experimental results also indicate that introducing the policy structure improves the performance on tasks with D4RL benchmarking datasets.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton & Barto, 2018) has had remarkable success in a variety of applications. Many of its successes have been achieved in online learning settings where the RL agent interacts with the environment during the learning process. However, such interactions are often time consuming and computationally expensive. The desirability of reducing the number of interactions in RL has motivated an active interest in offline RL (Levine et al., 2020) , also known as batch RL (Lange et al., 2012) . In offline RL, the goal is to learn the optimal policy from a prepared dataset collected through an arbitrary and unknown process. Prior work on offline RL has focused on how to avoid estimating the Q-values of action that are out of the data distribution (Fujimoto et al., 2019; Fujimoto & Gu, 2021) . While previous studies often address this issue in terms of the regularization of critics (Kumar et al., 2020; An et al., 2021; Kostrikov et al., 2021; 2022) , we propose to mitigate the issue from the perspective of the policy structure. Our hypothesis is that evaluation of the out-of-distribution actions can be avoided by dividing the state-action space, which is potentially achieved by learning discrete latent variables of the state-action space. When the data distribution is multimodal, as shown in Figure 1 (a), fitting a policy modeled with a unimodal distribution such as a Gaussian distribution may lead to interpolation between separate modes, which will result in the value estimation of actions that are out of the data distribution (Figure 1(b) ). To avoid this, we employ a mixture of deterministic policies (Figure 1(c) ). We divide the state-action space and learn sub-policies for each region. Ideally, this approach will enable us to avoid interpolating separate modes of the data distribution. In this study, we propose to train a mixture policy by learning discrete latent representations, which can be interpreted as dividing the state-action space and learning sub-policies that correspond to each region. We derive the proposed algorithm by considering the variational lower bound of the offline RL objective function. We refer to the proposed algorithm as Value-Weighted Variational Auto-Encoder (V2AE). The main contribution of this study is an offline RL algorithm that trains a mixture policy by learning discrete latent variables. We also propose a regularization technique for a mixture policy based on the mutual information. We empirically show that the proposed regularization technique improves the performance of the proposed algorithm. A previous study in (Brandfonbrener et al., 2021) reports the accumulation of the critic loss values during the training phase, which was considered the result of generating out-of-distribution actions. We show empiri-

Samples with high Q-values

Region there is no datapoint (a) Samples in state-action space. Unimodal distribution fitted to samples (b) Result of fitting a unimodal distribution. Fit a deterministic policy for each region (c) Proposed approach. Figure 1 : Schematic illustration of the proposed approach. (a) In offline RL, the distribution of samples is often multimodal; (b) Fitting a unimodal distribution to such samples can lead to estimating the action out of the data distribution; (c) In the proposed approach, the latent discrete variable of the state-action space is learned, and a deterministic policy is learned for each region. cally that the use of the proposed mixture policy can reduce the accumulation of the approximation error in offline RL. In experiments with benchmark tasks in D4RL (Fu et al., 2020) , the proposed algorithms proved to be competitive with the popular offline RL methods. While the experimental result shows the promising performance , we aim to shed light on the importance of the policy structure as inductive bias in offline RL, rather than claim the state-of-the-art performance.

2. RELATED WORK

Recent studies have shown that regularization is the crucial component for offline RL (Fujimoto et al., 2019; Kumar et al., 2020; Levine et al., 2020; Kostrikov et al., 2021) . For example, Kostrikov et al. (2021) proposed a regularization based on Fischer divergence, and Fujimoto & Gu (2021) showed that simply adding a behavior cloning term to the objective function in TD3 can achieve state-of-the-art performance on D4RL benchmark tasks (Fu et al., 2020) . Other research has investigated the structure of the critic, proposing the use of an ensemble of critics (An et al., 2021) or offering a one-step offline RL approach (Brandfonbrener et al., 2021; Goo & Niekum, 2021) . Previous studies (Fujimoto et al., 2019; Fujimoto & Gu, 2021) have indicated that the source of the value approximation error is "extrapolation error" that occurs when the value of state-action pairs that are not contained in a given dataset is estimated. Our hypothesis is that such "extrapolation error" can be mitigated by dividing the state-action space, which is potentially achieved by learning discrete latent variables. We investigate the effect of incorporating the policy structure as inductive bias in offline RL, which has not been fully investigated. Learning the discrete latent variable in the context of RL is closely related to a mixture policy, where a policy is represented as a combination of a finite number of sub-policies. In a mixture policy, one of the sub-policies is activated for a given state, and the module that determines which sub-policy to use is often called the gating policy (Daniel et al., 2016) . Because of the two-layered structure, a mixture policy is also called a hierarchical policy (Daniel et al., 2016) . Although we do not consider temporal abstraction in this study, we note that a well-known hierarchical RL framework with temporal abstraction is the option critic (Bacon et al., 2017) . Since we consider policies without temporal abstraction, we use the term "mixture policy," following the terminology in Wulfmeier et al. (2021) . Previous studies have demonstrated the advantages of mixture policies in online RL (Osa et al., 2019; Zhang & Whiteson, 2019; Wulfmeier et al., 2020; 2021; Akrour et al., 2021) . In these existing methods, sub-policies are often trained to cover separate modes of the Q-function, which is similar to our idea. While existing methods have leveraged the latent variable in offline RL (Zhou et al., 2020; Chen et al., 2021b; 2022) , the latent variable is continuous in these methods. As indicated by studies on latent representations (Kingma & Welling, 2014; Dupont, 2018; Brown et al., 2020) , we think that the use of the discrete latent variable should be investigated in offline RL.

3. PROBLEM FORMULATION

Reinforcement Learning Consider a reinforcement learning problem under a Markov decision process (MDP) defined by a tuple (S, A, P, r, γ, d), where S is the state space, A is the action space, P(s t+1 |s t , a t ) is the transition probability density, r(s, a) is the reward function, γ is the discount factor, and d(s 0 ) is the probability density of the initial state. A policy π(a|s) : S × A → R is defined as the conditional probability density over actions given states. The goal of RL is to identify a policy that maximizes the expected return E[R 0 |π], where the return is the sum of discounted rewards over time given by R t = T k=t γ k-t r(s k , a k ). The Q-function, Q π (s, a), is defined as the expected return when starting from state s and taking action a, then following policy π under a given MDP (Sutton & Barto, 2018) . In offline RL, it is assumed that the learning agent is given a fixed dataset, D = {(s i , a i , r i )} N i=1 , that consists of states, actions, and rewards collected by an unknown behavior policy. The goal of offline RL is to obtain a policy that maximizes the expected return using D, without online interactions with an environment during the learning process.

Objective function

We formulate the problem of offline RL as follows. Given a dataset D = {(s i , a i , r i )} N i=1 obtained through the interactions between a behavior policy β(a|s) and the environment, our goal is to obtain a policy π that maximizes the expected return. In the process of training a policy in offline RL, the expected return is evaluated with respect to the states stored in the given dataset. Thus, the objective function is given by J(π) = E s∼D,a∼π [f π (s, a)] , where f π is a function that quantifies the performance of policy π. In RL, there are several choices for f π as indicated in Schulman et al. (2016) . TD3 employed the action-value function as f π (s, a) = Q π (s, a), and A2C employed the advantage-function as f π (s, a) = A π (s, a) (Mnih et al., 2016) . Other previous studies employ shaping with the exponential function as f π (s, a) = exp Q π (s, a) (Peters & Schaal, 2007) or f π (s, a) = exp A π (s, a) (Neumann & Peters, 2008; Wang et al., 2018) . Without loss of generality, we assume that the objective function is given by equation 1. We derive the proposed algorithm by considering the lower bound of the objective function of offline RL in equation 1.

Mixture policy

In this study, we consider a mixture of policy given by π(a|s) = z∈Z π gate (z|s)π sub (a|s, z), ( ) where z is a discrete latent variable, π gate (z|s) is the gating policy that determines the value of the latent variable, and π sub (a|s, z) is the sub-policy that determines the action for given s and z. We assume that a sub-policy π sub (a|s, z) is deterministic; the sub-policy determines the action for given s and z in a deterministic manner as a = µ θ (s, z), where µ θ (s, z) is parameterized with a vector θ. Additionally, we assume that the gating policy π gate (z|s) determines the latent variable as z = arg max z ′ Q w (s, µ θ (s, z ′ )), where Q w (s, a) is the estimated Q-function parameterized with a vector w. This gating policy is applicable to the objective functions such as f π (s, a) = exp (Q π (s, a)), f π (s, a) = A π (s, a) , and f π (s, a) = exp (A π (s, a)). Please refer to Appendix A for details.

BOUND

We consider a training procedure based on policy iteration (Sutton & Barto, 2018) , where the critic and the policy are iteratively improved. In this section, we describe the policy update procedure in the proposed method. To derive the update rule for the policy parameter θ, we first consider the lower bound of the objective function log J(π) in equation 1. We assume that f π (s, a) in equation 1 is approximated with f π w (s, a), which is parameterized with a vector w. In a manner similar to Dayan & Hinton (1997) ; Kober & Peters (2011) , when f π w (s, a) > 0 for any s and a, we can determine the lower bound of log J(π) using Jensen's inequality as follows: log J(π) ≈ log d β (s)π θ (a|s) f π w (s, a)dsda (4) = log d β (s)β(a|s) π θ (a|s) β(a|s) f π w (s, a)dsda ≥ d β (s)β(a|s) log π θ (a|s) β(a|s) f π w (s, a)dsda (5) = E (s,a)∼D log π θ (a|s) f π w (s, a) -E (s,a)∼D log β(a|s) f π w (s, a) , where β(a|s) is the behavior policy, which is used for collecting the given dataset, and d β (s) is the stationary distribution over the state induced by executing the behavior policy β(a|s). The second term in equation 6 is independent of the policy parameter θ. Thus, we can maximize the lower bound of J(π) by maximizing N i=1 log π θ (a i |s i ) f π w (s i , a i ). When we employ f π (s, a) = exp (A π (s, a)) and the policy is Gaussian, the resulting algorithm is equivalent to AWAC (Nair et al., 2020) . To employ a mixture policy with a discrete latent variable, we further analyze the objective function in equation 6. As in Kingma & Welling (2014); Sohn et al. (2015) , we can obtain a variant of the variational lower bound of the conditional log-likelihood: log π θ (a i |s i ) ≥ -D KL (q ϕ (z|s i , a i )||p(z|s i )) + E z∼p(z|si,ai) [log π θ (a i |s i , z)] = ℓ cvae (s i , a i ; θ, ϕ), where q ϕ (z|s, a) is the approximate posterior distribution parameterized with a vector ϕ, and p(z|s) is the true posterior distribution. The derivation of equation 7 is provided in Appendix B. Although it is often assumed in prior studies (Fujimoto et al., 2019) that z is statistically independent of s, i.e., p(z|s) = p(z), in our framework p(z|s) should represent the behavior of the gating policy π θ (z|s). Recognizing the challenge of representing exactly the gating policy π θ (z|s) in equation 3, we approximate it with the softmax distribution given by p(z|s) = exp Q w (s, µ θ (s, z)) z∈Z exp Q w (s, µ θ (s, z)) . Since we employ double-clipped Q-learning as in Fujimoto et al. (2018) , we compute Q w s, µ θ (s, z) = min j=1,2 Q wj s, µ θ (s, z) in our implementation. The second term in equation 7 is approximated as the mean squared error, as in the standard implementation of VAE. Based on equation 6 and equation 7, we obtain the objective function for training a mixture policy as L ML (θ, ϕ) = N i=1 f π (s i , a i )ℓ cvae (s i , a i ; θ, ϕ). This objective can be regarded as the weighted maximum likelihood (Kober & Peters, 2011) for a mixture policy. Our objective function can be viewed as reconstructing the state-action action pairs with adaptive weights, as in Peters & Schaal (2007) ; Nair et al. (2020) . Therefore, the policy samples actions within the support and does not evaluate out-of-distribution actions. The primary difference between the proposed method and the existing methods (Peters & Schaal, 2007; Nair et al., 2020) is that the use of a mixture of policies conditioned on discrete latent variables in our approach can be regarded as dividing the state-action space. For example, in AWAC (Nair et al., 2020) , a unimodal policy is used to reconstruct all of the "good" actions in the given dataset. However, in the context of offline RL, the given dataset may contain samples collected by diverse behaviors, and enforcing the policy to cover all the modes in the dataset can degrade the resulting performance. In our approach, the policy π θ (a|s, z) is encouraged to mimic the state-action pairs which are assigned to the same values of z, without mimicking the actions which are assigned the different values of z. In addition, we also propose a regularization technique for a mixture policy based on the mutual information (MI) between z and the state action pair (s, a), which we denote by I(z; s, a). As shown by Barber & Agakov (2003) , the variational lower bound of I(z; s, a) is given by E[log g ψ (z|s, a)], where g ψ (z|s, a) is an auxiliary distribution to approximate the posterior distribution p(z|s, a). Thus, the final objective function is given by L(θ, ϕ, ψ) = L ML (θ, ϕ) + λ N i=1 E z∼p(z) log g ψ (z|s i , µ θ (s i , z)). The MI-based regularization using the second term in equation 10 encourages the diversity of the behaviors encoded in the sub-policy π(a|s, z). We empirically show that this regularization improves the performance of the proposed method in Section 7. Algorithm 1 Value-Weighted Variational Auto-Encoder (V2AE) Initialize the actor µ θ , critic Q wj for j = 1, 2, and the posterior q ϕ (z|s, a) for t = 1 to T do Sample a minibatch {(s i , a i , s ′ i , r i )} M i=1 from D for each element (s i , a i , s ′ i , r i ) do Compute the value of the latent variable for s ′ : z ′ = arg max z′ Q w (s ′ , µ θ ′ (s ′ , z′ )) Compute the target value: y i = r + γ min j=1,2 Q wj (s, µ θ ′ (s ′ , z ′ )) end for Update the critic by minimizing M i=1 y i -Q wj (s i , a i ) 2 for j = 1, 2 if t mod d interval = 0 then Update the actor and the posterior by maximizing equation 9 (optionally) Update the actor by maximizing M i=1 E z∼p(z) log g ψ (z|s i , µ θ (s i , z)) end if end for

5. TRAINING THE CRITIC FOR A MIXTURE OF DETERMINISTIC POLICIES

To derive the objective function for training the critic for the mixture of deterministic policies using the gating policy in equation 3, we consider the following operator: T z Q z = r(s, a) + γE s ′ max z ′ Q z (s ′ , µ(s ′ , z ′ )) . ( ) We refer to the operator T z as the latent-max-Q operator. Following Ghasemipour et al. ( 2021), we prove the following theorems. Theorem 5.1. In the tabular setting, T z is a contraction operator in the L ∞ norm. Hence, with repeated applications of the T z , any initial Q function converges to a unique fixed point. The proof of Theorem 5.1 is provided in the Appendix C. Theorem 5.2. Let Q z denote the unique fixed point achieved in Theorem 5.1, and let π z denote the policy that chooses the latent variable as z = arg max z ′ Q(s, µ(s, z ′ )) and outputs the action given by µ(s, z) in a deterministic manner. Then Q z is the Q-value function corresponding to π z . Proof. (Theorem 5.2) Rearranging equation 11 with z ′ = arg max Q z (s ′ , µ(s ′ , z ′ )), we obtain T z Q z = r(s, a) + γE s ′ E a ′ ∼πz [Q z (s ′ , a ′ )] . Since by definition Q z is the unique fixed point of T z , we have our result. These theorems show that the latent-max-Q operator T z retains the contraction and fixed-point existence properties. Based on these results, we estimate the Q-function by applying the latent-max-Q operator. In our implementation, we employed double-clipped Q-learning (Fujimoto et al., 2018) . Thus, given a dataset D, the critic is trained by minimizing L(w j ) = (si,ai,s ′ i ,ri)∈D Q wj (s i , a i ) -y i 2 for j = 1, 2, where the target value y i is computed as y i = r i + γ max z ′ ∈Z min j=1,2 Q wj (s ′ , µ θ ′ (s ′ i , z ′ )).

6. PRACTICAL IMPLEMENTATION

The proposed Value-Weighted Variational Auto-Encoder (V2AE) algorithm is summarized in Algorithm 1. As in TD3 (Fujimoto et al., 2018) , the actor is updated once every after d interval updates of the critics. In our implementation, we set d interval = 2. For modeling the discrete latent variable, we use the Gumbel-softmax trick (Jang et al., 2017; Maddison et al., 2017) . We also use the state normalization used in TD3+BC (Fujimoto & Gu, 2021) . In preliminary experiments, we found that when f π (s, a) = exp (bA π (s, a)) in equation 9, the scaling factor b has non-trivial effects on performance and the best value of b is different for each task. To avoid changing the scaling parameter for each task, we used the normalization of the advantage function as f π (s, a) = exp α A π (s, a) -max (s,ã)∈Dbatch A π (s, ã) max (s,ã)∈Dbatch A π (s, ã) -min (s,ã)∈Dbatch A π (s, ã) , where D batch is a mini-batch sampled from the given dataset D and α is a constant; we set α = 10 for Mujoco tasks in our experiments. For other hyperparameter details, please refer to the Appendix L.

7. EXPERIMENTS

We investigated the effect of the policy structure on the resulting performance and the training error of the critics. In the first experiment, we performed a comparative assessment of TD3+BC (Fujimoto & Gu, 2021) , AWAC (Nair et al., 2020) and V2AE on a toy problem, where the distribution of samples in a given dataset is multimodal. We also conducted a quantitative comparison between the proposed methods and baseline methods with D4RL benchmark tasks (Fu et al., 2020) . In the following section, we refer to the proposed method based on the objective in equation 9 as V2AE and a variant of the proposed method with the MI-based regularization in equation 10 as infoV2AE. In both the toy problem and the D4RL tasks, we used the author-provided implementations of TD3+BC, and our implementation of V2AE and AWAC are based on the author-implementation of TD3+BC.

7.1. MULTIMODAL DATA DISTRIBUTION ON TOY TASK

To show the effect of the multimodal data distribution in a given dataset, we evaluated the performance of TD3+BC, AWAC, and V2AE on a toy task shown in Figure 2 . The differences between the compared methods are summarized in Table 1 . In our implementation of AWAC and V2AE, we used the state normalization and double-clipped Q-learning as in TD3+BC and the normalization of the advantage function described in Section 6. The difference between AWAC and V2AE indicates the effect of the policy representation. In this toy task, the agent is represented as a point mass, the state is the position of the point mass in two-dimensional space, and the action is the small displacement of the point mass. There are two goals in this task, which are indicated by red circles in Figure 2 . A blue circle indicates the start position in Figure 2 , and there are three obstacles, which are indicated by black solid circles. In this task, the reward is sparse: When the agent reaches one of the goals, the agent receives the reward 1.0, and the episode ends. If the agent makes a contact with one of the obstacles, the agent receives the reward -1.0, and the episode ends. In the given dataset, trajectories to the two goals are provided, and there is no information on which goal the agent is heading. Table 2 : Performance on the toy task. TD3+BC AWAC V2AE -0.2± 0.4 0.33± 0.44 1.0±0.0 The score is summarized in Table 2 . Among the evaluated methods, only V2AE successfully solved the task. The policy obtained by TD3+BC did not reach the goal in a stable manner as shown in Figure 2 (b). Similarly, as shown in Figure 2 (c), the policy learned by AWAC often slows down around the point (0, 0) and fails to reach the goal. This behavior implies that AWAC attempted to average over multiple modes of the distribution. By contrast, the policy learned by V2AE successfully reaches one of the goals. As the main difference between AWAC and V2AE is the policy architecture, this result shows that the unimodal policy distribution fails to deal with the mutimodal data distribution, while a mixture policy employed in V2AE successfully dealt with it. The activation of the sub-policies is visualized in Figure 2 (e). The color indicates the value of the discrete latent variable given by the gating policy z * = arg max z Q w (s, µ(s, z)). Figure 2 (d) shows that different sub-policies are activated for different regions, which indicates that V2AE appropriately divided the state-action space.

7.2. D4RL BENCHMARK TASKS

We evaluated the performance of the proposed method on the benchmarking tasks in D4RL. As baseline methods, we used TD3+BC, CQL (Kumar et al., 2020) , AWAC, and easyBCQ (Brandfonbrener et al., 2021) . EasyBCQ is a one-step RL version of Batch Constraint Q-learning (Fujimoto et al., 2019) . All the baseline methods use double-clipped Q-learning for the critic in this experiment. For easyBCQ, the state normalization and double-clipped Q-learning are used in our implementation. The implementation of AWAC and V2AE was identical to those in the previous experiment, and the difference between AWAC and V2AE indicates the effect of the policy representation. In our evaluation, |Z| = 8 was used for V2AE and infoV2AE. The effect of the dimensionality of the discrete latent variable is shown in Appendix D. In this study, we evaluated the baseline methods with antmaze-v0, Kitchen, Adroit, and mujoco-v2 tasks. For completeness, we provide the results with the mujoco-v0 datasets in Appendix E. Performance scores Comparisons with the baseline methods are shown in Tables 3 and 4 . The bold text indicates the best performance; the underlined text indicates the tasks for which V2AE outperformed AWAC. In mujoco-v2 tasks, we can see that the performance of V2AE is comparable/superior to the state-of-the-art methods on mujoco-v2 tasks. Regarding the comparison between V2AE and AWAC, the performance of V2AE matched or exceeded that of AWAC except the Hopper-expert and Hopper-medium-expert tasks. This result also confirms that the use of a mixture policy is beneficial for these tasks, although the benefits of using a mixture policy should be task-dependent. Remarkably, our implementation of AWAC showed significantly better performance than that reported by Nair et al. (2020) because we employed the double-clipped Q-learning and state normalization. We provide the comparison between the original results of AWAC and ours in Appendix F. In addition, infoV2AE, which employs the MI-based regularization, outperformed V2AE on various tasks, and infoV2AE showed the best performance in terms of the sum of the overall scores on mujoco-v2. This result shows that encouraging the diversity of the sub-policies using the proposed MI-based regularization is effective for V2AE. The advantage of V2AE and infoV2AE over baseline methods is apparent for Antmaze tasks, which involve dealing with long horizons and require "stitching" together sub-trajectories in a given dataset (Fu et al., 2020) . TD3+BC, CQL, easyBCQ, and AWAC did not work well on Antmaze tasks, and this result indicates that techniques used in these algorithm are not effective for such tasks. As the difference between V2AE and AWAC indicates the effect of different policy representations, the results indicate that the use of the mixture policy improves the performance to deal with tasks that require stitching together sub-trajectories in a given dataset. A comparison with additional baseline methods is provided in Appendix G. The effect of the scaling parameter α in V2AE and infoV2AE is reported in Appendix H.

Critic loss function

To investigate the effect of the policy structure on the critic loss function, we compare the value of the critic loss function between AWAC and V2AE. In addition, to see the differences between the discrete and continuous latent variables, we also evaluated a variant of V2AE with the continuous latent variables, which we refer to as cV2AE. The normalized scores and the value of the critic loss function during training are shown in Figure 3 . The value of the critic loss given by equation 13 is plotted for every 5,000 updates. The shaded area that indicates the error bar of the critic loss is removed for cV2AE on walker2d-medium-expert-v2 because the critic loss of cV2AE exploded. Previous studies indicated that the value of the critic loss can accumulate over iterations (Brandfonbrener et al., 2021) . Figure 3 shows the accumulation of the critic loss in AWAC on the mujoco-v2 tasks. Importantly, in V2AE, the value of the critic loss is clearly lower, and the performance of the policy is better than that in AWAC. Brandfonbrener et al. (2021) showed that the accumulation of the critic loss can be reduced by introducing regularization. Since the difference between AWAC and V2AE is the policy representation, our results indicate that the use of the mixture policy can also mitigate the accumulation of the critic loss in offline RL. This result suggests the importance of incorporating inductive bias in the policy structure. However, it is worth noting that the reduction of the critic loss given by equation 13 does not necessarily lead to the improved performance of the policy. In halfcheetah-medium-expert-v2, although the critic loss was significantly lower in V2AE than that in AWAC, there was no significant difference in performance between V2AE and AWAC. Recently, Fujimoto et al. (2022) indicated that a lower value of the critic loss given by equation 13 does not necessarily lead to better performance, and the observation in Fujimoto et al. (2022) fits with what we observed in our experiments. Regarding the use of the continuous latent variable, while cV2AE often achieves lower values of the critic loss than AWAC, the critic loss of cV2AE occasionally explodes. A possible explanation for these results may be that learning the continuous latent variable provides flexibility to a policy representation but separate modes of the objective function are still interpolated in the learned latent space, which results in generating out-of-distribution samples. The performance of cV2AE on mujoco-v2 tasks is reported in Appendix I, and more detailed results on the critic loss are provided in Appendix J. Qualitative evaluation of the learned latent variable The activation of sub-policies in V2AE on the pen-human-v0 task is shown in Figure 4 . In Figure 4 , the top row shows the state at the 20th, 40th, 60th, and 80th time steps; the graphs in the middle row of the figure show the actionvalues of each of the sub-policies at each state, Q w (s, µ(s, z)). A previous study on the option framework (Smith et al., 2018) reported that in the existing method only a few options are activated and that the remainder of the options do not learn meaningful behaviors. In contrast, the results in Figure 4 show that the value of each of the sub-policies Q w (s, µ(s, z)) changes over time, and various sub-policies are activated during the execution. A figure showing the activation of subpolicies in different episodes is provided in Appendix K.

8. CONCLUSION

We have proposed an algorithm, which we call Value-Weighted Variational Auto-Encoder (V2AE), for training a mixture policy in offline RL. The V2AE algorithm can be interpreted as an approach that divides the state-action space by learning the discrete latent variable and learns the corresponding sub-policies in each region. Our experimental results show that the use of the mixture policy can mitigate the issue of critic error accumulation in offline RL. In addition, the experimental results also indicate that the use of the mixture policy significantly improves the performance of an offline RL algorithm. We believe that our work represents an important step toward leveraging the policy structure in offline RL.

REPRODUCIBILITY STATEMENT

To ensure the reproducibility of the results, we will provide the codes by posting a comment directed to the reviewers and area chairs and putting a link to an anonymous repository after the discussion forums are opened. We also summarized how to implement the proposed method and its variants in Section 6. More detailed descriptions of the implementation can be found in Appendix L. We used open-sourced benchmark tasks in D4RL to ensure reproducibility.

A APPLICABILITY OF THE GATING POLICY

In the proposed algorithm, we employ the gating policy that determines the value of the latent variable as follows: z = arg max z ′ Q w (s, µ θ (s, z ′ )), where µ θ (s, z ′ ) represents the deterministic sub-policy, and Q w (s, a) is the approximated Qfunction. While this gating policy looks specific to the case where Q π (s, a) is maximized, this gating policy is applicable to other objective function such as A π (s, a), exp(Q π (s, a)), and exp(A π (s, a)). The advantage function is defined as A π (s, a) = Q π (s, a) -V π (s). As the state value function V π (s) is independent from the action, we can obtain the following equation: arg max a Q π (s, a) = arg max a (Q π (s, a) -V π (s)) (17) = arg max a A π (s, a). Thus, we can rewrite the gating policy as z = arg max z ′ Q w (s, µ θ (s, z ′ )) (19) = arg max z ′ A w (s, µ θ (s, z ′ )). Similarly, the exponential function exp(•) is the monotonically increasing function. Thus, the extrema of Q π (s, a) is also the extrema of exp(Q π (s, a)). Thus, we can also rewrite the gating policy as z = arg max z ′ Q w (s, µ θ (s, z ′ )) (21) = arg max z ′ exp (Q w (s, µ θ (s, z ′ ))) = arg max z ′ A w (s, µ θ (s, z ′ )) (23) = arg max z ′ exp (A w (s, µ θ (s, z ′ ))) . ( ) As the latent variable is discrete, we can analytically compute arg max z ′ Q w (s, µ θ (s, z ′ )). As we used this gating policy, the gating policy is deterministic in our implementation.

B DERIVATION OF THE VARIATIONAL LOWER BOUND

We employed the variational lower bound in equation 7 to derive the objective function for the proposed method. We provide the detailed derivation, which was omitted in the main manuscript due to the page limitation. We denote by p(•) the true distribution induced by the policy π θ (a|s), and the distribution that approximates the true distribution is denoted by q(•). The KL divergence between q(x) and p(x) is defined as D KL q(x)||p(x) = q(x) log q(x) p(x) dz. ( ) Based on the above notation, the log-likelihood log π θ (a i |s i ) can be written as follows: In the first line, we consider the marginalization over z. As log π(a|s) is independent from the latent variable z, the equality in the first line holds. As D KL q(z|s, a)||p(z|s, a) > 0, we can obtain a variant of the variational lower bound of the conditional log-likelihood: log π θ (a i |s i ) = q ϕ (z|s i , a i ) log π θ (a i |s i )dz (26) = q ϕ (z|s i , a i ) log π(a i |s i , z) + log p(z|s i ) -log p(z|s i , a i ) dz (27) = q ϕ (z|s i , a i ) log q(z|s i , a i ) p(z|s i , a i ) dz -q(z|s i , a i ) log q(z|s i , a i ) p(z|s i ) dz (28) + q(z|s i , a i ) log π θ (a i |s i , z)dz (29) = D KL q(z|s i , a i )||p(z|s i , a i ) -D KL (q(z|s i , a i )||p(z|s i )) + E z∼q(z|si,ai)) [log π θ (a i |s i , z)] . log π θ (a i |s i ) ≥ -D KL (q ϕ (z|s i , a i )||p(z|s i )) + E z∼q(z|si,ai)) [log π θ (a i |s i , z)] .

C PROOF OF THE CONTRACTION OF THE LATENT-MAX-Q OPERATOR

We consider the operator T z , which is given by T z Q(s, a) = E s ′ r(s, a) + γ max z Q(s ′ , µ(s ′ , z ′ )) . To prove the contraction of T z , we use the infinity norm given by ∥Q 1 -Q 2 ∥ ∞ = max s∈S,a∈A |Q 1 (s, a) -Q 2 (s, a)| , where Q 1 and Q 2 are different estimates of the Q-function. We consider the infinity norm of the difference between these two estimates, Q 1 and Q 2 , after applying the operator T z : ∥T z Q 1 -T z Q 2 ∥ ∞ (34) = E s ′ r(s, a) + γ max z ′ Q 1 (s ′ , µ(s ′ , z ′ )) -E s ′ r(s, a) + γ max z ′ Q 2 (s ′ , µ(s ′ , z ′ )) (35) = γE s ′ max z ′ Q 1 (s ′ , µ(s ′ , z ′ )) -γE s ′ max z ′ Q 2 (s ′ , µ(s ′ , z ′ )) (36) = γ E s ′ max z ′ Q 1 (s ′ , µ(s ′ , z ′ )) -E s ′ max z ′ Q 2 (s ′ , µ(s ′ , z ′ )) (37) = γ E s ′ max z ′ Q 1 (s ′ , µ(s ′ , z ′ )) -max z ′ Q 2 (s ′ , µ(s ′ , z ′ )) (38) ≤ γ |E s ′ ∥Q 1 -Q 2 ∥ ∞ | (39) ≤ γ ∥Q 1 -Q 2 ∥ ∞ . ( ) The above relationship shows the contraction of the operator T z .

D EFFECT OF THE DIMENSIONALITY OF THE DISCRETE LATENT VARIABLE

In our evaluation, we first evaluated the effect of the dimensionality of the discrete latent variable. The results are shown in Table 5 . As |Z| = 8 consistently showed satisfactory performance, |Z| = 8 was used in the subsequent evaluations. As shown, infoV2AE with |Z| = 8 demonstrated the best performance, while the performance with |Z| = 16 and |Z| = 32 is comparable. These results show that the performance of the policy is not so sensitive to the dimensionality of the latent varialbe. However, the performance with |Z| = 4 is relatively weak, and it indicates that the policy may not be expressive enough when the dimensionality of the latent varialbe is too small. (Fujimoto & Gu, 2021) , there is a non-negligible difference when using the D4RL-v2 and D4RL-v0 datasets. Having presented our results using D4RL-v2 datasets in the main manuscript, we here show the results using D4RL-v0 datasets in Table 6 . For CQL, we show the results obtained by re-running the codes given in the websitefoot_0 . We performed the experiments with our implementations of easyBCQ and AWAC. The results were mixed, and it is difficult to identify the best algorithm for the MuJoCo tasks with the D4RL-v0 datasets. However, we can see that the performance of V2AE was comparable to the state-of-the-art methods on the MuJoCo tasks with the D4RL-v0 datasets. Especially, compared to the baseline methods, the performance of V2AE consistently matched or exceeded the performance of the other methods on datasets containing expert demonstrations, e.g., *-expert and *-medium-expert.

F PERFORMANCE OF AWAC

We used two techniques in our implementation of AWAC, which are not used in the original paper of AWAC (Nair et al., 2020) : 1) state normalization proposed in (Fujimoto & Gu, 2021) and 2) normalization of the advantage function in equation 15. We provide the comparison between the performance of our implementation of AWAC and the performance reported in the original paper (Nair et al., 2020) . Our experiments show that our implementation significantly outperformed the results reported in the original paper of AWAC.

G COMPARISON WITH ADDITIONAL BASELINES

We provide comparison with additional baselines on mujoco-v2 tasks in D4RL in Table 8 . We show the results of MAPLE, which is a recent model-based offline algorithm using latent representations (Chen et al., 2021b) . In addition, we show the results of Decision Transformer (Chen et al., 2021a) , as a representative of transformer-based methods. We also included Implicit Q-learning (IQL), which employs expectile regression for learning the Q-function Kostrikov et al. (2022) . While these methods are well-known and state-of-the-art, we omit them in the main manuscript due to the page limitation, and we focused on methods that use double-clipped Q-learning for the critic in the main manuscript. For each baseline methods, we adapted the results reported in the original paper. V2AE and infoV2AE show performance consistently better than or comparable to these baseline methods, although our implementation of V2AE and infoV2AE do not employ techniques such as ensemble of critics. This result indicates the significant effect on the policy structure in offline RL.

G.1 COMPARISON WITH AWAC USING A GAUSSIAN MIXTURE POLICY

We also provide the comparison with AWAC with a Gaussian mixture policy in Table 9 . Overall, the use of the Gaussian mixture policy does not improve the performance of AWAC. A recent study by Chen et al. (2022) indicates that the use of the Gaussian mixture policy does not improve the performance of AWAC. When a Gaussian mixture policy is employed, a Gaussian component that covers a large part of the state space often appears, and it governs the resulting performance. This behavior is also often observed in the context of the option critic for online RL (see Smith et al. (2018) ). If that happens, we cannot exploit the discrete latent variable. By contrast, we employed a mixture of deterministic policies, not a mixture of the Gaussian policies. Unlike a Gaussian policy, a deterministic policy can be seen as a distribution given by the dirac-delta function, which is the limit of the Gaussian as the standard deviation goes zero. When we use the mixture of deterministic policies, we will not have a component that covers the large state space. The use of the mixture of deterministic policies allows us to distribute each component separately, and we can exploit the benefit of the discrete latent variable. In addition, we also leverage the variational lower bound to incorporate the prior distribution of the latent variable, which enables us to obtain meaningful latent representations. These are the reasons why our methods clearly outperformed AWAC, while simply using the Gaussian mixture policy does not improve the performance of AWAC.

G.2 COMPARISON WITH LAPO

Recently, Chen et al. (2022) proposed an algorithm, Latent-variable Advantage-weighted Policy Optimization (LAPO), which leverage the continuous latent space for policy learning. As the approach is related to our method V2AE, we provide the detailed comparison in this section. We found that the authors' implementation of LAPO in https://github.com/pcchenxi/ LAPO-offlienRL includes techniques to improve the performance, such as action normalization and clipping of the target value for the state-value function. Such techniques are compatible with V2AE and our baseline methods, but they are not incorporated in our experiments. For this reason, we first evaluated LAPO without these techniques, which we refer to as LAPO-. The results are reported in Table 10 . Overall, V2AE and infoV2AE outperformed LAPO on mujoco-v2 tasks and Antmaze-v0 tasks. While our method V2AE leverages the discrete latent space, LAPO uses the continuous latent space. In this sense, LAPO is closer to cV2AE, which is a variant of V2AE that uses the continuous latent space. In the main manuscript, we showed that the critic loss value often explodes during the training of cV2AE. Similarly, the critic log value increases rapidly at the beginning of the training in LAPO, as shown in Figure 5 . While the critic log value often decreases at the end of the training of LAPO, the critic loss value is still higher than that of V2AE. As the performance of the policy is better in V2AE in these tasks, the surge of the critic loss value indicates the generation of the out-of-distribution actions during the training in LAPO. These results support our observations in the main manuscript and indicate that the use of the discrete latent variable is effective to reduce the generation of the out-of-distribution actions.

G.3 COMPARISON WITH IQL AND LAPO ON ANTMAZE

As the Antmaze tasks are considered as challenging tasks on offline RL, we provide detailed results on the Antmaze tasks in this section. As we found that a few techniques used in LAPO are essential to reproduce the results in Chen et al. (2022) , we also evaluated variants of V2AE, which incorporate such techniques. In methods named like "xx lapo", the techniques used in LAPO Chen et al. (2022) are incorporated. We summarized the techniques/differences as below: • computation of the target value as Q target = 0.7 min(Q 1 , Q 2 ) + 0.3 max(Q 1 , Q 2 ) • target value clipping v target = min(max(v target , v min ), v max ) • reward scaling r ← r × 100 • network architecture: three hidden layers with 256 units, learning rate: 2e-4 For target value clipping, v min and v max are computed as v min = min D r • 1 1-γ and v max = max D r • 1 1-γ , respectively. In LAPO-, IQL, V2AE, infoV2AE, we used the reward r ← r -1, following the protocol used in the original IQL paper. While action normalization is used in LAPO, we found that the action normalization does not improve the performance of IQL and V2AE. Therefore, action normalization is used only in LAPO in our results. In the table below, LAPO-shows LAPO without these techniques, although the network architecture is the same as the original LAPO. "IQL (rerun)" indicates the results of re-running IQL in our code base, which fairly reproduced the results reported in the original IQL paper. "V2AE" shows the result reported in our main manuscript. In Tables 11 and 12 , we reported the results across five different seeds with 100 test episodes after 1 million updates. Regarding LAPO, we could not reproduce the results reported in the paper, may be due to the difference of the experiment procedure. In the original paper (Chen et al., 2022) , the methods are evaluated across three random seeds with 10 test episodes. When using the techniques used in LAPO, the overall performance of V2AE and infoV2AE was better than that of LAPO. While LAPO nicely showed how to leverage the latent space for offline RL, LAPO does not utilize the discrete latent space. The difference in performance between LAPO and our methods shows the benefit of leveraging the discrete latent variable. Regarding IQL, we could fairly reproduce the performance reported in Kostrikov et al. (2022) . Interestingly, the techniques used in LAPO did not improve the performance of IQL. When using the techniques used in LAPO, the overall performance of V2AE and infoV2AE was better or comparable to IQL. This result shows that incorporating the policy structure can achieve the performance achieved by using the state-of-the-art algorithm for the critic.

H EFFECT OF THE SCALING PARAMETER

To investigate the effect of the value of α in equation 15, we evaluated the performance of V2AE and infoV2AE with α = 5.0 and α = 10.0. The results are shown in Table 13 . Except hopper-mediumexpert-v2, results with α = 10.0 are better than those with α = 5.0. In the main manuscript, we report the result of α = 5.0 for infoV2AE on hopper-medium-expert-v2, α = 10.0 for the rest of the mujoco-v2 tasks.

I PERFORMANCE OF CV2AE

We provide the performance of cV2AE, which is a variant of V2AE with the continuous latent variable, on mujoco-v2 tasks in Table 14 . cV2AE outperformed AWAC in terms of the sum of the score across the tasks. However, overall performance is lower than V2AE and infoV2AE. In cV2AE, a policy is more flexible and expressive than a Gaussian policy, which resulted in performance better than AWAC. Meanwhile, the continuous latent space still interpolates the separate modes of the multimodal distribution in cV2AE, and it will result in generating the out-of-distribution actions in offline RL. We think that is the reason why V2AE with the discrete latent variable outperforms cV2AE with the continuous latent variable. 

J THE CRITIC LOSS AND NORMALIZED SCORE IN MUJOCO TASKS

We show the critic loss and the normalized scores for MuJoCo tasks in Figures 6 7 8 . The critic loss in V2AE is comparable to or lower than that of AWAC in MuJoCo tasks. However, there are exceptions, such as hopper-expert-v2 and hopper-medium-expert-v2, where the critic loss surges in V2AE but not in AWAC. Thus, it is necessary to be aware that the use of the mixture policy may compound the accumulation of the critic error in some rare cases, while the use of the mixture policy often mitigates the accumulation of the critic error for many tasks. Regarding the use of the continuous latent variable, while cV2AE often achieves the lower values of the critic loss than AWAC, the critic loss of cV2AE occasionally explodes. A possible explanation for these results may be that learning the continuous latent variable can provide flexibility to a policy representation but separate modes of the objective function is still interpolated in the learned latent space, which results in generating out-of-distribution samples. 

K ADDITIONAL RESULTS FOR THE ACTIVATION OF THE LATENT-CONDITIONED POLICY

Figure 9 shows the activation of sub-policies in V2AE on the pen-human-v0 task in three episodes. Here, we used the same policy trained with 10,000 updates. As shown, the target orientation of the object is different in each episode, and different sub-policies are activated to achieve the given goals. The sub-policy corresponding to z = 7 (yellow) is often activated during 20-40 steps. We think that the sub-policy corresponding to z = 7 is effective for changing the orientation of the pen in this task. On the contrary, the sub-policies corresponding to z = 0, 1, 2 are not activated in many episodes. We think that these sub-policies correspond to the state-action pairs with low Q-values. In Offline RL, a given dataset often contains the mixed quality of the behavior, and it is not effective to reproduce all the behavior. Rather, the behaviors correspond to the low scores should be ignored. We think that the proposed algorithm can implicitly achieve this by dividing the state-action space and investigate the quality of the sub-policies corresponding to each region. This qualitative result supports the claim that the different behaviors are encoded in each of the sub-policies. A visualization of the state-action pairs and the corresponding value of the latent variable on the kitchen-complete-v0 task is provided in Figure 10 . We also show the histogram of the activated sub-policies in Figure 11 and the activation of sub-policies in a successful episode on the kitchencomplete-v0 task in Figure 12 .

L HYPERPARAMETERS AND IMPLEMENTATION DETAILS

Computational resource and license The experiments were run with Amazon Web Service and workstations with NVIDIA RTX 3090 GPU and Intel Core i9-10980XE CPU at 3.0GHz. We used the physics simulator MuJoCo (Todorov et al., 2012) under the institutional license, and later we switched to the Apache license. Software The software versions used in the experiments are listed below: • Python 3.8 • Pytorch 1.10.0 • Gym 0.21.0 • MuJoCo 2.1.0 • mujoco-py 2.1.2.14 We used the author implementations for TD3+BCfoot_1 , CQL, and EDACfoot_2 . For mujoco-v2 and Antmaze tasks, we used the updated CQL implementationfoot_3 . For mujoco-v0, Kitchen and Adroit tasks, we used the original CQL implementationfoot_4 . For a fair comparison with V2AE, we implemented easy-BCQ and AWAC ourselves. For easyBCQ, we used the hyperparameters used in the author-provided implementationfoot_5 . We implemented V2AE and AWAC based on the the author-provided implementation of TD3. In our implementation of easyBCQ and AWAC, double-clipped Q-learning was employed. To minimize the difference between V2AE and AWAC, we also used a delayed update of the policy in both V2AE and AWAC. For simplicity, we did not use a regularization technique for the actor such as the dropout layer used in (Kostrikov et al., 2022) , although the use of such techniques should further improve performance. In our implementation of V2AE, the value of z is a part of the input to the actor network. Thus, the different behaviors corresponding to the different values of z are represented by the same actor network. Computation of the advantage function In V2AE, a policy is deterministic, as both the gating policy π(z|s) and sub-policy π(a|s, z) are deterministic. Thus, the state-value function is given by V π (s) = max z Q π (s, µ(s, z)). Therefore, the advantage function is given by A π (s, a) = Q π (s, a) -V π (s) = Q π (s, a) -max z Q π (s, µ(s, z)). (42) In the update of the policy, we used the target actor in the second term in equation 42. Thus, in our implementation, the advantage function is approximated with A(s, a; w, θ ′ ) = Q(s, a; w) -max Implementation of a variant of V2AE with the continuous latent variable We refer to a variant of V2AE with the continuous latent variable as cV2AE in this paper. The differences from our main algorithm V2AE, which employs the discrete latent variable, are the reparametrization of sampling the latent variable, the KL divergence term in the loss function, and how to compute z * = arg max z Q w (s, µ(s, z)). We assume that the prior distribution p(z) and posterior distribution p(z|s, a) is Gaussian, and we used the reparameterization as in the standard VAE as z = µ z + ϵ • σ z , (46) where µ z and σ z are the mean and the standard deviation of the posterior distribution p(z|s, a). As the latent variable z is continuous, we cannot compute the softmax distribution in equation 8 given by p(z|s) = exp Q w (s, µ θ (s, z)) z∈Z exp Q w (s, µ θ (s, z)) . (47) Thus, we assumed that p(z|s) as the Gaussian distribution as in previous studies (Fujimoto et al., 2019) , and the KL divergence term in the objective function in equation 7 is computed as in the standard VAE (Kingma & Welling, 2014) . When the latent variable z is continuous, we cannot explicitly compute z * = arg max z Q w (s, µ(s, z)). Thus, we approximated it as follows: we generate N samples as z i ∼ N (0, 1) for i = 0, . . . , N and determine ẑ * = arg max zi Q w (s, µ(s, z i )). In our implementation, the continuous latent variable z is two-dimensional, and N = 100. When the dimensionality of z is higher, the approximation of z * = arg max z Q w (s, µ(s, z)) gets more coarse. We found that the performance is significantly lower when the dimensionality of z is more than two.

Number of updates

In kitchen-complete-v0, pen-human-v0, hammer-human-v0, door-human-v0, and relocate-human-v0, the number of samples contained in the dataset is significantly smaller than those for other datasets. While the datasets for MuJoCo tasks contained approximately 1 million samples, the numbers of samples in the Adroit-human tasks and the kitchen-complete-v0 were as follows: kitchen-complete-v0: 3679 samples; pen-human-v0: 4950 samples; hammer-human-v0: 11264 samples; door-human-v0: 6703 samples; and relocate-human-v0: 9906 samples. Thus, in kitchen-complete-v0, pen-human-v0, hammer-human-v0, door-human-v0, and relocate-human-v0, we updated the policy 10,000 times, while for the other tasks we updated the policy 1 million times. The above-mentioned number of policy updates was applied to all methods except easyBCQ, which is a one-step offline RL algorithm. For easyBCQ, the behavior policy was trained 500,000 times and the critics were trained 2 million times, except for *-human-v0 tasks in Adroit tasks, where the behavior policy was trained 50,000 times, and the critics were trained 200,000 times. Hyperparameters Tables 15-19 provide the hyperparameters used in the experiments. 



https://github.com/aviralkumar2907/CQL https://github.com/sfujim/TD3_BC https://github.com/snu-mllab/EDAC https://github.com/young-geng/CQL https://github.com/aviralkumar2907/CQL https://github.com/davidbrandfonbrener/onestep-rl



Figure 2: Performance on a simple task with multimodal data distribution.

Figure 3: Critic loss and normalized score during the training.

Figure 5: Critic loss and normalized score during training on mujoco-v2 locomotion tasks.

Figure 6: Critic loss and normalized score during training on HalfCheetah tasks.

Figure 8: Critic loss and normalized score during training on walker2d tasks.

zQ(s, µ θ ′ (s, z); w). Computation for maximizing N i=1 E z∼p(z) log g ψ (z|si, µ θ (si, z)).

Figure 13: Connection between q ϕ (z|s, a), µ θ (s, z), and g ψ (z|s, a) during the training.

Algorithm setup in the experiment.

Results on mujoco tasks using D4RL-v2 datasets and AntMaze tasks. Average normalized scores over the last 10 test episodes and five seeds are shown. HC = HalfCheetah, HP = Hopper, WK = Walker2d.

Results on Kitchen and Adroit tasks using the average normalized scores over the last 10 test episodes and five seeds.

Effect of the dimensionality of the discrete latent variable on Adroit tasks.

Results on Mujoco tasks using D4RL-v0 datasets. The average normalized scores after 1 million time steps over the last 10 test episodes and five seeds are shown. HC = HalfCheetah, HP = Hopper, WK = Walker2d, Med.-E = Mediaum-Expert, Med.-R = Medium-Replay, Med.=Medium, and Rand.=Random. The bold text indicates the best performance; the underlined text indicates the tasks for which V2AE outperformed AWAC. |Z| = 8 for V2AE.

Comparison of AWAC implementations on Mujoco tasks using D4RL-v0 datasets. The average normalized scores after 1 million time steps over the last 10 test episodes and five seeds are shown. The bold text indicates the tasks for which the performance of our implemnetation of AWAC exceeded the that of AWAC reported in the original paper(Nair et al., 2020). HC = HalfCheetah, HP = Hopper, WK = Walker2d.

Results on MuJoCo tasks using D4RL-v2 datasets. Average normalized scores over the last 10 test episodes and five seeds are shown. HC = HalfCheetah, HP = Hopper, WK = Walker2d. The gray text indicates the performance lower than that of V2AE/infoV2AE. The bold text indicates the best performance.

Comparison with AWAC with a Gaussian mixture policy (mixAWAC) using D4RL-v2 datasets. Average normalized scores over the last 10 test episodes and five seeds are shown. The bold text indicates the best performance.

Comparison with AWAC with a Gaussian mixture policy (mixAWAC) using D4RL-v2 datasets. Average normalized scores over the last 10 test episodes and five seeds are shown. The bold text indicates the best performance.

Comparison with IQL on Antmaze tasks. Average normalized scores over the last 100 test episodes and five seeds are shown. The bold text indicates the best performance.

Comparison with LAPO on Antmaze tasks. Average normalized scores over the last 100 test episodes and five seeds are shown. The bold text indicates the best performance.

Results on mujoco tasks using D4RL-v2 datasets with different values of α in equation 15. Average normalized scores over the last 10 test episodes and five seeds are shown. HC = HalfCheetah, HP = Hopper, WK = Walker2d.

Results on mujoco tasks using D4RL-v2 datasets and AntMaze tasks. Average normalized scores over the last 10 test episodes and five seeds are shown. HC = HalfCheetah, HP = Hopper, WK = Walker2d.

Hyperparameters of TD3+BC. The default hyperparameters in the TD3+BC GitHub are used.

Hyperparameters of CQL. The default hyperparameters in the CQL GitHub are used.

Hyperparameters of easyBCQ. The default hyperparameters in the easyBCQ in the authorprovided codes in GitHub are used.

annex

Target smoothing in V2AE In V2AE, a policy is given by a mixture of deterministic sub-policies, where a sub-policy is selected in a deterministic manner, as in equation 3. Thus, the mixture policy in our framework is deterministic. As reported by Fujimoto & Gu (2021) , the use of a deterministic policy may lead to an overfitting of the critic to the narrow peaks. Since our policy is deterministic, we also employed a technique called target policy smoothing used in TD3. Thus, the target value in (a) Latent variable sampled from q ϕ (z|s, a). equation 14 is modified aswhere ϵ clip is given byand the constant c defines the range of the noise.

Mutual-information-based regularization

To implement the MI-based regularization, we introduced another network to represent g ψ (z|s, a) in addition to a network that represent the posterior distribution q ϕ (z|s, a). When maximizing the objective L ML in equation 9, both the actor µ θ (s, z) and the auxiliary distribution g ψ (z|s, a) are updated, but the posterior distribution q ϕ (z|s, a) is frozen. When maximizing N i=1 E z∼p(z) log g ψ (z|s i , µ θ (s i , z)), both the actor µ θ (s, z) and the posterior distribution q ϕ (z|s, a) are updated, but the auxiliary distribution g ψ (z|s, a) is frozen. For maximizing log g ψ (z|s i , µ θ (s i , z)), the latent variable is sampled from the prior distribution, i.e. the uniform distribution in this case, and maximization of log g ψ (z|s i , µ θ (s i , z)) is approximated as minimizing the squared difference ||z -ẑ|| 2 , where ẑ is the output of the network that represents g ψ (z|s, a). 

