IN-SAMPLE ACTOR CRITIC FOR OFFLINE REINFORCE-MENT LEARNING

Abstract

Offline reinforcement learning suffers from out-of-distribution issue and extrapolation error. Most methods penalize the out-of-distribution state-action pairs or regularize the trained policy towards the behavior policy but cannot guarantee to get rid of extrapolation error. We propose In-sample Actor Critic (IAC), which utilizes sampling-importance resampling to execute in-sample policy evaluation. IAC only uses the target Q-values of the actions in the dataset to evaluate the trained policy, thus avoiding extrapolation error. The proposed method performs unbiased policy evaluation and has a lower variance than importance sampling in many cases. Empirical results show that IAC obtains competitive performance compared to the state-of-the-art methods on Gym-MuJoCo locomotion domains and much more challenging AntMaze domains. * Equal contribution. RL. In RL, the environment is typically assumed to be a Markov decision process (MDP) (S, A, R, p, γ), with state space S, action space A, scalar reward function R, transition dynam-

1. INTRODUCTION

Reinforcement learning (RL) aims to solve sequential decision problems and has received extensive attention in recent years (Mnih et al., 2015) . However, the practical applications of RL meet several challenges, such as risky attempts during exploration and time-consuming data collecting phase. Offline RL is capable of tackling these issues without interaction with the environment. It can get rid of unsafe exploration and could tap into existing large-scale datasets (Gulcehre et al., 2020) . However, offline RL suffers from out-of-distribution (OOD) issue and extrapolation error (Fujimoto et al., 2019) . Numerous works have been proposed to overcome these issues. One branch of popular methods penalizes the OOD state-action pairs or regularizes the trained policy towards the behavior policy (Fujimoto & Gu, 2021; Kumar et al., 2020) . These methods have to control the degree of regularization to balance pessimism and generalization, and thus are sensitive to the regularization level (Fujimoto & Gu, 2021) . In addition, OOD constraints cannot guarantee to avoid extrapolation error (Kostrikov et al., 2022) . Another branch chooses to eliminate extrapolation error completely (Brandfonbrener et al., 2021; Kostrikov et al., 2022) . These methods conduct in-sample learning by only querying the Q-values of the actions in the dataset when formulating the Bellman target. However, OneStep RL (Brandfonbrener et al., 2021) estimates the behavior policy's Q-value according to SARSA (Sutton & Barto, 2018) and only improves the policy a step based on the Qvalue function, which has a limited potential to discover the optimal policy hidden in the dataset. IQL (Kostrikov et al., 2022) relies on expectile regression to perform implicit value iteration. It can be regarded as in-support Q-learning when the expectile approaches 1, but suffers from instability in this case. Thus a suboptimal solution is obtained by using a smaller expectile. Besides, these two lines of study adapt the trained policy to the fixed dataset's distribution. Then one question appears-"Can we introduce the concept of in-sample learning to iterative policy iteration, which is a commonly used paradigm to solve RL"? General policy iteration cannot be updated in an in-sample style, since the trained policy will inevitably produce actions that are out of the dataset (out-of-sample) and provide overestimated Q-target for policy evaluation. To enable in-sample learning, we first consider sampling the target action from the dataset and reweighting the temporal difference gradient via importance sampling. However, it is known that importance sampling suffers from high variance (Precup et al., 2001) and would impair the training process. In this paper, we propose In-sample Actor Critic (IAC), which performs iterative policy iteration and simultaneously follows the principle of in-sample learning to eliminate extrapolation error. We resort to sampling-importance resampling (Rubin, 1988) to reduce variance and execute in-sample policy evaluation, which formulates the gradient as it is sampled from the trained policy. To this end, we use SumTree to sample according to the importance resampling weight. For policy improvement, we tap into advantage-weighted regression (Peng et al., 2019) to control the deviation from the behavior policy. The proposed method executes unbiased policy evaluation and has smaller variance than importance sampling in many cases. We point out that, unlike previous methods, IAC adapts the dataset's distribution to match the trained policy during learning dynamically. We test IAC on D4RL benchmark (Fu et al., 2020) , including Gym-MuJoCo locomotion domains and much more challenging AntMaze domains. The empirical results show the effectiveness of IAC.

2. RELATED WORKS

Offline RL. Offline RL, previously termed batch RL (Ernst et al., 2005; Riedmiller, 2005) , provides a static dataset to learn a policy. It has received attention recently due to the extensive usage of deep function approximators and the availability of large-scale datasets (Fujimoto et al., 2019; Ghasemipour et al., 2021) . However, it suffers from extrapolation error due to OOD actions. Some works attempt to penalize the Q-values of OOD actions (Kumar et al., 2020; An et al., 2021) . Other methods force the trained policy to be close to the behavior policy by KL divergence (Wu et al., 2019) , behavior cloning (Fujimoto & Gu, 2021) , or Maximum Mean Discrepancy(MMD) (Kumar et al., 2019) . These methods cannot eliminate extrapolation error and require a regularization hyperparameter to control the constraint level to balance pessimism and generalization. Another branch chooses to only refer to the Q-values of in-sample actions when formulating the Bellman target without querying the values of actions not contained in the dataset (Brandfonbrener et al., 2021; Kostrikov et al., 2022) . By doing so, they can avoid extrapolation error. OneStep RL (Brandfonbrener et al., 2021) evaluates the behavior policy's Q-value function and only conducts one-step of policy improvement without off-policy evaluation. However, it performs worse than the multi-step counterparts when a large dataset with good coverage is provided. IQL (Kostrikov et al., 2022) draws on expectile regression to approximate an upper expectile of the value distribution, and executes multi-step dynamic programming update. When the expectile approaches 1, it resembles in-support Q-learning in theory but suffers from instability in practice. Thus a suboptimal solution is obtained by using a smaller expectile. Our proposed method opens up a venue for in-sample iterative policy iteration. It prevents querying unseen actions and is unbiased for policy evaluation. In practice, it modifies the sampling distribution to allow better computational efficiency. OptiDICE (Lee et al., 2021) also does not refer to out-of-sample samples. However, it involves with complex minmax optimization and requires a normalization constraint to stabilize the learning process. Importance sampling. Importance sampling's application in RL has a long history (Precup, 2000) for its unbiasedness and consistency (Kahn & Marshall, 1953) . Importance sampling suffers from high variance, especially for long horizon tasks and high dimensional spaces (Levine et al., 2020) . Weighted importance sampling (Mahmood et al., 2014; Munos et al., 2016) and truncated importance sampling (Espeholt et al., 2018) have been developed to reduce variance. Recently, marginalized importance sampling has been proposed to mitigate the high variance of the multiplication of importance ratios for off-policy evaluation (Nachum et al., 2019; Liu et al., 2018) . Samplingimportance resampling is an alternative strategy that samples the data from the dataset according to the importance ratio (Rubin, 1988; Smith & Gelfand, 1992; Gordon et al., 1993) . It has been applied in Sequential Monte Carlo sampling (Skare et al., 2003) and off-policy evaluation (Schlegel et al., 2019) . To the best of our knowledge, our work is the first to draw on sampling-importance resampling to solve the extrapolation error problem in offline RL. ics p, and discount factor γ (Sutton & Barto, 2018) . The agent interacts with the MDP according to a policy π(a|s), which is a mapping from states to actions (deterministic policy) or a probability distribution over actions (stochastic policy). The goal of the agent is to obtain a policy that maximizes the expected discounted return: E π [ ∞ t=0 γ t r t ]. Off-policy RL methods based on approximate dynamic programming typically utilize a state-action value function (Q-value function), which measures the expected discounted return obtained by starting from the state-action pair (s, a) and then following the policy π: Q(s, a) = E π [ ∞ t=0 γ t r t |s 0 = s, a 0 = a] . Offline RL. In offline RL, the agent is prohibited from interacting with the environment. Instead, it is provided with a fixed dataset collected by some unknown behavior policy β. Ordinary approximate dynamic programming methods evaluate policy π by minimizing temporal difference error, according to the following loss L T D (θ) = E (s,a,s ′ )∼D [(r(s, a) + γE a ′ ∼π ϕ (•|s ′ ) Q θ (s ′ , a ′ ) -Q θ (s, a)) 2 ], where D is the dataset, π ϕ is a policy parameterized by ϕ, Q θ (s, a) is a Q function parameterized by θ, and Q θ (s, a) is a target network whose parameters are updated via Polyak averaging. We denote the i th transition (s i , a i , r i , s ′ i , a ′ i ) in D as x i , and D = {x 1 , . . . , x n }. For some transition x = (s, a, r, s ′ , a ′ ), let transition-wise TD update ∆(x) be the gradient of transition-wise TD error, ∆(x) = ∇ θ Q θ (s, a)(Q θ (s, a) -r(s, a) -γQ θ (s ′ , a ′ )). For the convenience of subsequent theoretical analysis, we also define the expected value of the TD update based on the gradient of Eqn. (1) by replacing the empirical distribution of the dataset with β induced distribution 1 ∆ T D = E x∼pπ [∆(x)], where p π = d β (s, a)P (s ′ |s, a)π(a ′ |s ′ ) and d β (s, a) is the normalized and discounted state-action occupancy measure of the policy β. That is, d β (s, a) = (1 - γ)E [ ∞ t=0 γ t I (s t = s, a t = a) | a t ∼ π (• | s t )]. Besides policy evaluation, a typical policy iteration also includes policy improvement. In continuous action space, a stochastic policy can be updated by reparameterization: ϕ ← argmax ϕ E s∼D,ϵ∼N [Q θ (s, f ϕ (ϵ; s))] , where N is a Gaussian distribution. In offline RL, OOD actions a ′ can produce erroneous values for Q θ (s ′ , a ′ ) in Q-value evaluation and lead to an inaccurate estimation of Q-value. Then in policy improvement stage, where the policy is optimized to maximize the estimated Q θ , the policy will prefer OOD actions whose values have been overestimated, resulting in poor performance. Most current methods either directly constrain policy π or regularize OOD actions' values to constrain policy indirectly.

4. IN-SAMPLE ACTOR CRITIC

In the following, we introduce sampling-importance resampling and tap into it for offline RL to allow in-sample learning. Then we show that the proposed method is unbiased and has a smaller variance than importance sampling in many cases. Last, we present a practical implementation of the algorithm.

4.1. SAMPLING-IMPORTANCE RESAMPLING

Consider a statistical problem relevant to offline RL. Suppose we want to obtain samples from a target distribution p(x), but we only have access to samples from some proposal distribution q(x). How can we simulate random draws from p? Assuming supp(q) ⊇ supp(p), a classic algorithm sampling-importance resampling (SIR) (Rubin, 1988) addresses this with the following procedure: Step 1.(Sampling) Draw independent random samples {x 1 , . . . , x n } from q. Step 2.(Importance) Calculate the importance ratio for each x i : w(x i ) = p(x i )/q(x i ) End ( 0.5, 0.5) to (0.5, 0.5) Figure 1 : SIR for distribution correcting. We generate 100,000 random values from proposal distribution q, and use SIR to resample 10,000 random items out of it to approximate target distribution p. Sample histograms and the actual underlying density of both q(start) and p(end) are presented. Left: Uniform U (-2, 2) to Gaussian N (0, 0.5). Right: Gaussian N (-0.5, 0.5) to another Gaussian N (0.5, 0.5). Step 3.(Resampling) Draw x * from the discrete distribution over {x 1 , . . . , x n } with sampling probabilities ρ(x i ) = w(x i )/ n j=1 w(x j ). SIR is very similar to importance sampling(IS), except that IS samples according to q and multiplies the result by the importance ratio, while SIR corrects the distribution q by the importance ratio to approximate p. The following proposition shows the consistency of SIR: as n → ∞, the resampling distribution converges to the target distribution p. Proposition 1. If supp(q) ⊇ supp(p), as n → ∞, the samples from SIR will consist of independent draws from p. Namely, As n → ∞, x * is distributed according to p. All proofs could be found in Appendix B. Fig. 1 illustrates SIR for distribution correcting on simple one-dimensional cases. Note that Fig. 1 (left) simulates the case when a dataset is generated by a uniform distribution and the target distribution is Gaussian, while Fig. 1 (right) corresponds to when a dataset is generated by some Gaussian distribution, and the target distribution is Gaussian with a different mean.

4.2. IN-SAMPLE ACTOR CRITIC

In this work, we adopt SARSA-style in-sample learning because in-distribution constraints widely used in prior work might not be sufficient to avoid extrapolation error (Kostrikov et al., 2022) . Our method is based on policy iteration, which consists of policy evaluation (PE) and policy improvement (PI). In PE, using in-sample actions a ′ ∈ D rather than a ′ ∼ π(•|s ′ ) in the TD target introduces a bias. We consider introducing an importance ratio w(s ′ , a ′ ) = π(a ′ |s ′ )/β(a ′ |s ′ ). Under the assumption of importance sampling: ∆(x i )π(a ′ i |s ′ i ) = 0 whenever β(a ′ i |s ′ i ) = 0, Eqn. (2) can be rewritten as follows ∆ T D = E (s,a,s ′ )∼D E a ′ ∼β(•|s ′ ) [w(s ′ , a ′ )∇ θ Q θ (s, a)(Q θ (s, a) -r(s, a) -γQ θ (s ′ , a ′ ))]. Here, the assumption of IS (as well as SIR) actually coincides with the conception of supportconstrained policy set (Kumar et al., 2019) : Definition 2 (Support-constrained policy). Assuming the data distribution is generated by a behavior policy β, the support-constrained policy class Π β is defined as Π β = {π | π(a|s) = 0 whenever β(a|s) = 0} This means that a learned policy π(a|s) has a positive density only where the density of the behavior policy β(a|s) is positive, instead of the constraint on the value of the density π(a|s) and β(a|s) which is overly restrictive in many cases. Previous works have demonstrated the superiority of restricting the support of the learned policy (Kumar et al., 2019; Ghasemipour et al., 2021) In practice, it is unrealistic to use the whole dataset to empirically estimate ∆ T D (expected value of update) every iteration, in spite of its low variance. Consequently, we consider estimating ∆ T D by sampling a mini-batch of size k from the dataset. Specifically, we sample {x 1 , . . . , xk }, where xj is sampled uniformly from D = {x 1 , . . . , x n }. It leads to an IS estimator of ∆ T D : ∆IS = 1 k k j=1 w(s ′ j , a ′ j )∆(x j ), xj ∼ {x 1 , . . . , x n } uniformly (6) Though IS is consistent and unbiased (Kahn & Marshall, 1953) , it suffers from high or even infinite variance due to large magnitude IS ratios (Precup et al., 2001) . The high variance of ∆IS could destabilize the TD update and lead to a poor solution. In this work, we adopt SIR instead of IS to reduce the variance and stabilize training. Specifically, we remove the IS ratio and sample {x 1 , . . . , xk }, where xj is sampled from {x 1 , . . . , x n } with probability proportional to w(s ′ j , a ′ j ), rather than uniformly like all prior offline RL works. It leads to another SIR estimator of ∆ T D : ∆SIR = 1 k k j=1 ∆(x j ), xj ρ ∼ {x 1 , . . . , x n } with probability ρ j = w(s ′ j , a ′ j ) n i=1 w(s ′ i , a ′ i ) Intuitively, in offline RL setting, this resampling strategy reshapes the data distribution of D to adapt to the current policy. Unfortunately, unlike the IS estimator ∆IS , ∆SIR is a biased estimator of ∆ T D . Subsequently, we show that by simply multiplying ∆SIR with the average importance ratio in the buffer w := 1 n n i=1 w i , we get an unbiased estimate of ∆ T D . Theorem 3. Assume that an offline dataset D of n transitions is sampled i.i.d according to p β (x = (s, a, r, s ′ , a ′ )) = d β (s, a)P (s ′ |s, a)β(a ′ |s ′ ), and π is support-constrained (i.e., π ∈ Π β ). Then, E[ w ∆SIR ] = ∆ T D where ∆ T D is the expected update across all transitions in D defined in Eqn. (2); ∆SIR is the empirical update across the sampled mini-batch defined in Eqn. (7); w := 1 n n i=1 w i is the average importance ratio in the dataset. In fact, ∆SIR gives the correct direction, and we do not need to care about the actual value of the update. The reason is that, the scalar w remains the same across all mini-batches during SGD learning, so we can include w in the learning rate and just use ∆SIR as the update estimator in PE. We point out that there is no need to adjust the conventional learning rate, because for a large enough dataset, w is close to 1.foot_1 Theorem 3 guarantees that if the policy π is constrained within the support of behavior policy β during learning, our method yields an unbiased policy evaluation process via in-sample learning, thus avoiding extrapolation error. Conversely, if the support of the current policy deviates much from the dataset, which is common in practice when the dataset distribution is narrow, and the trained policy is randomly initialized, Theorem 3 can not provide performance guarantees. So in PI, we implicitly enforce a constraint with advantage-weighted regression (Peters & Schaal, 2007; Peng et al., 2019) , controlling deviation from the behavior policy. Since in PE, we have sampled transitions {x 1 , . . . , xk } non-uniformly from D, for convenience, we use the same transitions to perform PI, instead of sampling from D again uniformly, leading to the following loss: L π (ϕ) = -E (s,a) ρ ∼D exp(β(Q θ (s, a) -E â∼π ϕ (•|s) Q θ (s, â)) log π ϕ (a|s) , where ρ ∼ denotes sampling from discrete distribution ρ. Note that even though our method is in-sample, a constraint (implicit or explicit) is necessary due to both our theoretical requirement and empirical results of previous works. Among previous insample approaches, IQL (Kostrikov et al., 2022) adopts the same advantage-weighted regression in the policy extraction step, while the performance of OneStep RL (Brandfonbrener et al., 2021) will have a sharp drop without constraints. One possible reason is that in-sample methods do not update out-of-sample (s, a) pairs. Their Q-values are completely determined by the initialization and generalization of the neural network, which is uncontrolled. As a result, despite that in-sample methods address the Q-value extrapolation error, vanilla PI without constraints can still choose outof-sample actions whose Q-values are very inaccurate.

4.3. LOWER VARIANCE

Theorem 3 shows that the proposed method provides unbiased policy evaluation with in-sample learning. In this section, we theoretically prove that our SIR estimator (see Eqn. ( 7)) has a lower variance than the IS estimator in many cases and thus yields a more stable learning process. Proposition 4. Assume that the gradient is normalized, then the following result holds, V[ w ∆SIR ] ≤ V[ ∆IS ]. Proposition 4 indicates that when the scale of the gradient for the sample does not vary a lot across the dataset, there is a high probability that SIR will have a smaller variance than IS. Proposition 5. Assume that ∥∆(x)∥ 2 2 has a positive correlation with 1 β(a ′ |s ′ ) for x ∈ D and policy π is uniform, then the following holds, V[ w ∆SIR ] ≤ V[ ∆IS ]. In general, the sample with a large behavior distribution density usually has a small-scale gradient due to training, which corresponds to the assumption in Proposition 5.

4.4. PRACTICAL ALGORITHM

Algorithm 1 IAC  Input: Dataset D = {(s, a, r, s ′ , a ′ )} Initialize behavior policy β ω , policy network π ϕ , Q-network Q θ , and target Q-network Q θ ′ // θ ′ ← (1 -τ )θ ′ + τ θ end for IAC is the first practical algorithm to adapt the dataset distribution to the learned policy, and we design the algorithm to be as simple as possible to avoid some complex modules confusing our algorithm's impact on the final performance. Density Estimator. IAC requires the behavior density β to be the denominator of importance resampling weight (Eqn. ( 7)). We learn a parametric estimator for the behavior policy by maximum likelihood estimation. The estimated behavior policy β ω is parameterized by a Gaussian distribution. The objective of β ω is max βω E s,a∼D log β ω (a|s), ( ) where ω is the parameter of the estimated behavior policy. Policy Evaluation. In policy evaluation phase, IAC uses non-uniformly sampled SARSA (sampling proportional to ρ(s ′ , a ′ ) ) to evaluate the trained policy. We represent the policy with a Gaussian distribution for its simple form of density. Policy Improvement. In policy improvement phase, Eqn. ( 8) requires calculating the expectation of the Q-value concerning the current policy. We find that replacing the expectation with the Q-value of the policy's mean already obtain good performance. Also, it simplifies the training process without learning a V-function. SumTree. The importance resampling weight ρ is determined by π(a ′ |s ′ ) and β ω (a ′ |s ′ ). While β ω (a ′ |s ′ ) is fixed after pretraining, π(a ′ |s ′ ) changes as π updates during training. We adopt the Sumtree data structure to efficiently update ρ during training and sample proportional to ρ. It is similar to prioritized experience replay (PER) (Schaul et al., 2015) . In PER, ρ is implemented as the transition-wise Bellman error, where sampling proportional to ρ replays important transitions more frequently, and thus the Q-value is learned more efficiently. In our proposed algorithm, ρ is implemented as the importance resampling weight. Sampling proportional to ρ provides an unbiased and in-sample way to evaluate any support-constrained policy. Overall algorithm. Putting everything together, we summarize our final algorithm in Algorithm 1. Our algorithm first trains the estimated behavior policy using Eqn. ( 11) to obtain the behavior density. Then it turns to the Actor-Critic framework for policy training.

5.1. ONE-STEP AND MULTI-STEP DYNAMIC PROGRAMMING

The most significant advantage of one-step approaches (Brandfonbrener et al., 2021) is that value estimation is completely in-sample and thus more accurate than multi-step dynamic programming approaches, which propagate and magnify estimation errors. On the other hand, multi-step dynamic programming approaches can also propagate useful signals, which is essential for challenging tasks or low-performance datasets. IAC belongs to multi-step dynamic programming and enjoys the benefit of one-step approaches. To show the relationship between IAC and one-step approaches, we define a general SIR simulator: ∆η SIR = 1 k k j=1 ∆(x j ), xj ρ ∼ {x 1 , . . . , x n } with probability ρ j = w(s ′ j , a ′ j ) η n i=1 w(s ′ i , a ′ i ) η . ( ) Note that IAC corresponds to the case when η = 1 while it reduces to OneStep RL (Brandfonbrener et al., 2021) when η = 0. We have the following result about ∆η SIR . Proposition 6. Assume that ∀x ∈ D, ∆(x) = h, where h is a constant vector. Let η ∈ [0, 1], wη denote 1 n n j=1 w(s j , a j ) η . Assume that n j=1 w(s j , a j ) ≥ n, then the following holds V[ wη ∆η SIR ] ≤ V[ w ∆SIR ]. It indicates that η < 1 might bring a smaller variance when ∆(x) is the same for all x ∈ D. However, it might not be the case and introduces a bias when ∆(x) varies across the dataset. In our experiment, we show that the performance of choosing ∆1 SIR is better than that of choosing ∆0 SIR , which indicates that reducing the bias matters for resampling.

5.2. OTHER CHOICES OF IS

Other than Eqn. (4), reweighting the gradient of value function is an alternative choice to utilizing IS. The TD update of value function could be written as follows: ∆ T D = E (s,a,s ′ )∼D [w(s, a)∇ θ V θ (s)(V θ (s) -r(s, a) -γV θ (s ′ ))]. We point out that a Q-value function is still required and learned via Bellman update to learn a policy. This implementation increases computational complexity compared to IAC. In addition, learning three components simultaneously complicates the training process. In this section, we conduct several experiments to justify the validity of our proposed method. We aim to answer four questions: (1) Does SIR have a smaller variance than IS? (2) Does our method actually have a small extrapolation error? (3) Does our method perform better than previous methods on standard offline MuJoCo benchmarks? (4) How does each component of IAC contribute to our proposed method?

6.1. VARIANCE

We first test the variance of SIR and IS on a two-arm bandit task. The first and the second action's reward distributions are N (-1, 1) and N (1, 1), respectively. We fix the dataset's size to be 100, 000. Then we vary the ratios of these two actions' samples in the dataset to simulate a set of behavior policies. For a policy that chooses the two arms with identical probability, we evaluate the policy with SIR and IS based on each behavior policy in the set. Fig. 2 shows the variance of SIR and IS when provided with different behavior policies. According to Fig. 2 , SIR has a smaller variance than IS, especially when the dataset is highly imbalanced. In this part, we compare IAC to a baseline that replaces the resampling section with the general off-policy evaluation, which is updated by Eqn.

6.2. EXTRAPOLATION ERROR

(1). The policy update and the hyperparameters for IAC and the baseline are the same. Although advantage-weighted regression enforces a constraint on the policy, the baseline might encounter target actions that are out of the dataset. We experiment with both methods on four tasks in D4RL (Fu et al., 2020) . We plot the learned Q-values of IAC and the baseline in Fig. 3 . Also, we show the true Q-value of IAC by rollouting the trained policy for 1, 000 episodes and evaluating the Monte-Carlo return. The result shows that the learned Q-value of IAC is close to the true Qvalue. Note that the learned Q-value is smaller than the true Q-value on walker2d-medium and halfcheetah-random tasks. The reason is that taking the minimum of two target networks will lead to underestimation. By contrast, the Q-value of the baseline increases fast and is far larger than that of IAC. It indicates that our proposed method has a lower extrapolation error by only referring to the target actions in the dataset.

6.3. COMPARISONS ON OFFLINE RL BENCHMARKS

Gym locomotion tasks. We evaluate our proposed approach on the D4RL benchmark (Fu et al., 2020) in comparison to prior methods (Table 1 ). We focus on Gym-MuJoCo locomotion domains involving three agents: halfcheetah, hopper, and walker2d. For each agent, five datasets are provided which correspond to behavior policies with different qualities: random, medium, medium-replay, medium-expert, and expert. AntMaze tasks. We also compare our proposed method with prior methods in challenging AntMaze domains, which consist of sparse-reward tasks and require "stitching" fragments of suboptimal trajectories traveling undirectedly to find a path from the start to the goal of the maze. The results are shown in Table 2 . Baselines. Our offline RL baselines include both multi-step dynamic programming and one-step approaches. For the former, we compare to CQL (Kumar et al., 2020) , TD3+BC (Fujimoto & Gu, 2021) , and IQL (Kostrikov et al., 2022) . For the latter, we compare to OneStep RL (Brandfonbrener et al., 2021) . Comparison with one-step method. Note that one-step method corresponds to ∆0 SIR , which samples uniformly. In the AntMaze tasks, especially the medium and large ones, few nearoptimal trajectories are contained, and the reward signal is sparse. These domains require "stitching" parts of suboptimal trajectories to find a path from the start to the goal of the maze (Fu et al., 2020) . Therefore, onestep approaches yield bad performance in these challenging domains where multi-step dynamic programming is essential. We point out that using ∆1 SIR gives IAC the power of multi-step dynamic programming. At the same time, inheriting the advantages of onestep approaches, IAC uses in-sample data, thus having a low extrapolation error. As shown in Fig. 4 , our proposed method performs much better than choosing η = 0, which corresponds to OneStep RL. Comparison with importance sampling. We refer to the algorithm which updates the Q-value via importance sampling(seen in Eqn. ( 4)) as IAC-IS, and we test its performance on MuJoCo and AntMaze tasks. The result is shown in Table 1 , Table 2 , and Fig. 4 . IAC-IS performs worse than IAC slightly on Gym locomotion tasks. For the challenging AntMaze tasks, there is a large gap between the two algorithms. IAC-IS even obtains zero rewards on half of the tasks. The reason might be that IAC-IS has a larger variance than IAC, which would impair the learning process. Ablation on the estimated behavior policy. IAC requires access to a pre-trained behavior policy, which brings a computational burden. Removing the behavior policy and regarding the behavior policy density as a constant will introduce a bias but reduce the computational load. We refer to this variant as IAC-w/o-β. As shown in Table 1 and Table 2 , IAC-w/o-β could still obtain desirable performance on most Gym locomotion tasks and several AntMaze tasks. Thus, IAC-w/o-β is an appropriate choice for its lightweight property when the computational complexity is a priority.

7. CONCLUSION

In this paper, we propose IAC to conduct in-sample learning by sampling-importance resampling. IAC enjoys the benefits of both multi-step dynamic programming and in-sample learning, which only relies on the target Q-values of the actions in the dataset. IAC is unbiased and has a smaller variance than importance sampling in many cases. In addition, IAC is the first method to adapt the dataset's distribution to match the trained policy dynamically during learning. The experimental results show the effectiveness of our proposed method. In future work, we expect to find a better estimated behavior policy to boost our method, such as transformers (Vaswani et al., 2017) .

A SAMPLING METHODS

The problem is to find µ = E p f (x) = D f (x)p(x)dx where p is the target distribution on D ⊆ R d , when only allowed to sampling from some proposal distribution q on D. Define importance ratio w(x) = p(x)/q(x) A.1 IMPORTANCE SAMPLING condition: q(x) > 0 whenever f (x)p(x) ̸ = 0, i.e., supp(q) ⊇ supp(p • f ); E q |w(x)f (x)| < +∞ E p (f (x)) = supp(q) q(x) p(x) q(x) f (x)dx = E q [f (x)w(x)] IS estimator of µ: μIS = 1 n n i=1 f (x i )w(x i ), x i ∼ q bias: E q (μ IS ) = µ variance: Var q (μ IS ) = Var q (f (x)w(x)) n = 1 n D (f (x)p(x)) 2 q(x) dx -µ 2 = 1 n D (f (x)p(x) -µq(x)) 2 q(x) dx How to select a good proposal p? The numerator is small when f (x)p(x) -µq(x) is close to zero, that is, when q(x) is nearly proportional to f (x)p(x). From the denominator, we see that regions with small values of q(x) greatly magnify whatever lack of proportionality appears in the numerator. Theorem 7 (Optimality Theorem). For fixed n, The distribution q that minimizes the variances of μIS is q = |f (x)|p(x) |f (x)|p(x)dx ∝ |f (x)|p(x) A.2 SAMPLING-IMPORTANCE RESAMPLING Step 1.(Sampling) Draw an independent random sample {x 1 , . . . , x n } from the proposal distribution q. Step 2.(Importance) Calculate the importance ratio for each x i : w(x i ) = p(x i )/q(x i ) Step 3.(Resampling) Draw x * from the discrete distribution over {x 1 , . . . , x n } with sample probabilities, ρ(x i ) = w i / n j=1 w j .

A.3 BATCH SETTING

Question: If the problem is to compute µ = E p f (x), which is better, IS, SNIS, or SIR? Assume IS and SNIS have sample B = {x 1 , . . . , x n }, while SIR resamples k items batch b = {x 1 , . . . , xk } from B with probability ρ i proportional to w i . For fair comparison, we also consider batch version IS-b and SNIS-b, which resample k items b = {x 1 , . . . , xk } from B uniformly. The estimators are as follows, μIS = 1 n n i=1 f (x i )w(x i ) μIS-b = 1 k k i=1 f (x i )w(x i ) μSNIS = n i=1 f (x i )w(x i ) n i=1 w(x i ) μSNIS-b = k i=1 f (x i )w(x i ) k i=1 w(x i ) μSIR = 1 k k i=1 f (x i ) Proposition 8. μSIR has the same bias as μSNIS . Proof. E B∼q E b [μ SIR ] = E B∼q E b [ 1 k k j=1 f (x i )] = E B∼q [E b f (x 1 )] = E B∼q n i=1 w i n j=1 w j f (x i ) = E B∼q μSNIS Note that if p and q are normalized, w μSIR is unbiased, where w = 1  Var(μ IS-b |B) = 1 k   1 n n j=1 w(x j ) 2 ∥f (x j )∥ 2 2 -µ ⊤ B µ B   Var( w μSIR |B) = 1 k   w n n j=1 w j ∥f (x j )∥ 2 2 -µ ⊤ B µ B   Proof. Since we condition on the dataset B, the only source of randomness is the sampling mechanism. Each index is sampled independently so we have that, Var( w μSIR |B) = 1 k 2 k j=1 Var ( wf (x k )|B) = 1 k Var ( wf (x 1 )|B) and similarly Var(μ IS-b |B) = 1 k Var(w(x 1 )f (x 1 )|B) We can further simplify these expressions. For the IS-b estimator Var(μ IS-b |B) = 1 k Var(w(x 1 )f (x 1 )|B) = 1 k E[w(x 1 ) 2 f (x 1 ) ⊤ f (x 1 )|B] -E[w(x 1 )f (x 1 ) | B] ⊤ E[w(x 1 )f (x 1 ) | B] = 1 k E[w(x 1 ) 2 ∥f (x 1 )∥ 2 2 | B] -µ ⊤ B µ B since w(x 1 )f (x 1 ) | B is unbiased for µ B = 1 k   1 n n j=1 w(x j ) 2 ∥f (x j )∥ 2 2 -µ ⊤ B µ B   For the SIR estimator, recalling that w = 1 n n i=1 w i , we follow similar steps, Var( w μSIR |B) = 1 k Var ( wf (x 1 )|B) = 1 k E[ w2 f (x 1 ) ⊤ f (x 1 )|B] -E[ wf (x 1 ) | B] ⊤ E[ wf (x 1 ) | B] = 1 k E[ w2 ∥f (x 1 )∥ 2 2 | B] -µ ⊤ B µ B since wf (x 1 ) | B is unbiased for µ B = 1 k   n j=1 w2 w j n i=1 w i ∥f (x j )∥ 2 2 -µ ⊤ B µ B   = 1 k   w n n j=1 w j ∥f (x j )∥ 2 2 -µ ⊤ B µ B   Under what condition SIR estimator has a lower variance than IS-b? Condition 1: ∥f (x j )∥ 2 2 > c/w j for samples where w j ≥ w, and ∥f (x j )∥ 2 2 < c/w j for samples where w j < w, for some c > 0.

B PROOFS B.1 PROOF OF PROPOSITION 1

Sampling-importance resampling (SIR) aims at drawing a random sample from a target distribution p. Typically, SIR consists of three steps: Step 1.(Sampling) Draw an independent random sample {x 1 , . . . , x n } from the proposal distribution q. Step 2.(Importance) Calculate the importance ratio for each x i : w(x i ) = p(x i )/q(x i ) Step 3.(Resampling) Draw x * from the discrete distribution over {x 1 , . . . , x n } with sample probabilities, ρ(x i ) = w i / n j=1 w j . Proof. x * has cdf Pr (x * ≤ x 0 ) = n i=1 ρ i I[x i ∈ (-∞, x 0 )] = 1 n n i=1 w i I[x i ∈ (-∞, x 0 )] 1 n n i=1 w i -→ n→∞ E q w(x)I[x ∈ (-∞, x 0 )] E q w(x) = x0 -∞ p(x)dx ∞ -∞ p(x)dx Note that even if p and q are unnormalized (but can be normalized), this method still works. The sample size under SIR can be as large as desired. The less p resembles q, the larger the sample size n will need to be in order that the distribution of x * well approximates p.

B.2 PROOF OF THEOREM 3

Proof. Note that there are two source of randomness for the estimator ∆SIR . First D = {x 1 , . . . , x n } is sampled i.i.d. according to p β . Second, our method draws x i.i.d. from the discrete distribution over {x 1 , . . . , x n } placing mass ρ i on x i , forming a mini-batch b = {x 1 , . . . , xk }. E D∼p β E b∼ρ [ w ∆SIR ] = wE D∼p β E b∼ρ [ 1 k k j=1 ∆(x j )] = wE D∼p β [E x1∼ρ ∆(x 1 )] = wE D∼p β n i=1 ρ i ∆(x i ) = wE D∼p β n i=1 w i n j=1 w j ∆(x i ) = E D∼p β 1 n n i=1 w i ∆(x i ) = E x∼p β w∆(x) = E x∼p β π(a ′ |s ′ ) β(a ′ |s ′ ) ∆(x) = E x∼p β p π (x) p β (x) ∆(x) = E x∼pπ ∆(x) = ∆ T D B.3 PROOF OF PROPOSITION 4 Proof. For IS, we have V[ ∆IS ] = 1 k 1 n n j=1 w(x j ) 2 ∥∆(x j )∥ 2 2 -µ ⊤ B µ B , where k is the size of the batch and µ B is the expectation of the estimator. Since SIR is unbiased, we have V[ w ∆SIR ] = 1 k w n n j=1 w j ∥∆(x j )∥ 2 2 -µ ⊤ B µ B for SIR. Now the problem is to show that n j=1 w(x j ) 2 ∥∆(x j )∥ 2 2 ≥ w n j=1 w(x j )∥∆(x j )∥ 2 2 . n j=1 w(x j ) 2 ∥∆(x j )∥ 2 2 ≥ w n j=1 w(x j )∥∆(x j )∥ 2 2 (15) ⇔ n j=1 ρ 2 j ∥∆(x j )∥ 2 2 ≥ 1 n n j=1 ρ j ∥∆(x j )∥ 2 2 , where ρ j = w(xj ) n i=1 w(xi) = w(xj ) n w . Assume that normalized gradient is applied: ∥∆(x j )∥ 2 2 = 1. Then according to Cauchy-Schwarz Inequality, Eqn. ( 16) hold. n n j=1 ρ 2 j =   n j=1 1 2     n j=1 ρ 2 j   ≥   n j=1 1 × ρ j   2 = 1 = n j=1 ρ j B.4 PROOF OF PROPOSITION 5 Proof. Assume that (x 1 , ..., .x n ) is in descending order in terms of ∥∆(x i )∥ 2 2 , otherwise we could rearrange these items. Considering that ∥∆(x)∥ 2 2 is in positive relation to 1 β(a ′ |s ′ ) for x ∈ D and policy π is uniform. To simplify notation, we use ∥f (x)∥ to denote ∥∆(x)∥ 2 2 . We have ∥f (x 1 )∥ ≥ ∥f (x 2 )∥ ≥ • • • ≥ ∥f (x n )∥. ( ) ρ 1 ≥ ρ 2 ≥ • • • ≥ ρ n , (or w(x 1 ) ≥ w(x 2 ) ≥ • • • ≥ w(x n )) Let denote by j * the index of largest ρ j satisfying ρ j < 1 n , i.e., j * = min j : ρ j < 1 n , 1 ≤ j ≤ n It is easy to show j * > 1. We have ρ j -1 n ≥ 0 when 1 ≤ j ≤ j * -1 and ρ j -1 n < 0 when j * ≤ j ≤ n. Then n j=1 ρ j ∥f (x j )∥ 2 ρ j - 1 n = j * -1 j=1 ρ j ∥f (x j )∥ 2 ρ j - 1 n + n j=j * ρ j ∥f (x j )∥ 2 ρ j - 1 n ≥ j * -1 j=1 ρ j * -1 ∥f (x j * -1 )∥ 2 ρ j - 1 n + n j=j * ρ j ∥f (x j )∥ 2 ρ j - 1 n =ρ j * -1 ∥f (x j * -1 )∥ 2   1 - n j=j * ρ j - j * -1 n   + n j=j * ρ j ∥f (x j )∥ 2 ρ j - 1 n = n j=j * ρ j * -1 ∥f (x j * -1 )∥ 2 1 n -ρ j + n j=j * ρ j ∥f (x j )∥ 2 ρ j - 1 n = n j=j * ρ j * -1 ∥f (x j * -1 )∥ 2 -ρ j ∥f (x j )∥ 2 1 n -ρ j ≥ 0.

B.5 PROOF OF PROPOSITION 6

Assume that ∀x ∈ D, ∆(x) = h, where h is a constant vector. Let η ∈ [0, 1], wη denote 1 n n j=1 w(s j , a j ) η . Assume that n j=1 w(s j , a j ) ≥ n, then the following holds V[ wη ∆η SIR ] ≤ V[ w ∆SIR ]. Proof. we have V[ w ∆SIR ] = 1 k w n n j=1 w j ∥∆(x j )∥ 2 2 , V[ wη ∆η SIR ] = 1 k wη n n j=1 w η j ∥∆(x j )∥ 2 2 . When normalized gradient is applied, the problem is to show that w n j=1 w j ≥ wη n j=1 w η j ⇔ n j=1 w j ≥ n j=1 w η j According to Holder Inequality, we have   n j=1 x α j   1 α   n j=1 1 α α-1   1-1 α ≥ n j=1 x j ⇔   n j=1 x α j   1 α n 1-1 α ≥ n j=1 x j By choosing x j = w η j and α = 1 η , we have For the MuJoCo locomotion tasks, we average returns of over 10 evaluation trajectories and 5 random seeds, while for the Ant Maze tasks, we average over 100 evaluation trajectories and 5 random seeds. Following the suggestions of the authors of the dataset, we subtract 1 from the rewards for Modeling the behavior policy as a unimodal Gaussian distribution will limit its flexibility and representation ability. We consider capturing multi modes of the behavior policy. To that end, We experiment by modeling the action space as a discrete. Considering that the action range is [-1, 1], we split each action into 40 categories and each category has a range of 0.05. Then the behavior policy is estimated by cross-entropy. In this setting, the behavior policy is multi-modal. We term this variant IAC-MM. We compare this variant and IAC and the result is shown in Table 4 and Table 5 . The result shows in several settings the variant has marginally better performance. But the overall performance is worse compared to IAC. Especially in AntMaze tasks, IAC-MM suffers a performance drop. The reason might be that the classification task ignores the relation between nearby actions.

E DISCRETE DOMAIN

We also test IAC and IAC-w/o-β on the CartPole task which has discrete action space. The dataset contains the samples in the replay buffer when we train a discrete SAC(soft-actor-critic) until convergence. The result is shown in Fig. 7 . IAC-w/o-β has a much worse final performance than IAC. Also, the maximum performance during training is worse than IAC. 

F OTHER BENCHMARKS

To make a comprehensive comparison, we also compare AWAC and CRR with IAC. The results of MuJoCo locomation and AntMaze tasks are shown in Table 6 and Table 7 , respectively. The results show that our methods have better performance than these baselines. 



We omit the learning rate for simplicity w ≈ E s∼d β (s),a∼β(a|s) [ π(a|s) β(a|s) ] = s,a π(a|s) β(a|s) β(a|s)d β (s) = 1



Figure 2: SIR has a smaller variance than IS on a twoarm bandit task.

Figure 3: True Q-value of IAC, learned Q-values of IAC and a baseline without resampling.

Figure 4: Comparison with OneStep RL (η = 0) and importance sampling.

x i ) Proposition 9. μSIR is consistent as n → ∞ For variance, we compare the unbiased w μSIR and μIS-b . On the one side, μIS and μSNIS use entire dataset B that have much more items than batch b and should have a low-variance estimate. For another, μSNIS-b is baised and if p and q are normalized, the bias-corrected version of μSNIS-b is just μIS-b . Proposition 10. For a fixed B, let µ B = E p [f (x)|B]. The variance of w μSIR and μIS-b are as follows.

Figure 6: Learning Curves of IAC on AntMaze Tasks.

Figure 7: Learning Curves of CartPole task.

Figure 8: The effect of η on IAC.

Averaged normalized scores on MuJoCo locomotion on five seeds. Note that m=medium, m-r=medium-replay, r=random, m-e=medium-expert, and e=expert.

Averaged normalized scores on AntMaze on five seeds. Note that u=Umaze, u-d=Umazediverse, m-p=medium-replay, m-d=medium-diverse, l-p=large-replay, and l-d=large-diverse.Comparison with baselines. On the Gym locomotion tasks, we find that IAC outperforms prior methods. On the more challenging AntMaze task, IAC performs comparably to IQL and outperforms OneStep RL by a large margin.

Hyperparameters of policy training in IAC.

Averaged normalized scores on MuJoCo locomotion on five seeds. We compare IAC with IAC-MM(IAC with multi modal behavior policy), IAC-SR(IAC with state ratio),We term this variant IAC-VAE. The results of IAC-VAE are shown in Table4 and Table 5. Benefiting from the VAE estimator, IAC-VAE obtains better results in MuJoCo locomotion tasks.

Averaged normalized scores on AntMaze on five seeds. We compare IAC with IAC-MM, IAC-SR, IAC-VAE, and IAC-SNIS. Note that u=Umaze, u-d=Umazediverse, m-p=medium-replay, m-d=medium-diverse, l-p=large-replay, and l-d=largediverse.

Averaged normalized scores on MuJoCo locomotion on five seeds. Other than the baselines above, we compare with AWAC and CRR. Note that m=medium, m-r=medium-replay, r=random, m-e=medium-expert, and e=expert.

Runtime of TD3BC, CQL, IQL, IAC for halfcheetah-medium-replay on a GeForce RTX 3090.We test the runtime of IAC on halfcheetah-medium-replay on a GeForce RTX 3090. The results of IAC and other baselines are shown in Table8. It takes 2h30min for IAC to finish the task, which is comparable to other baselines. Note that it only takes two minutes for the pre-training part.

annex

Published as a conference paper at ICLR 2023 the Ant Maze datasets. We choose TD3 (Fujimoto et al., 2018) as our base algorithm and optimize a deterministic policy. To compute the SIR/IS ratio, we need the density of any action under the deterministic policy. For this, we assume all policies are Gaussian with a fixed variance 0.1. Note that IAC has no additional hyperparameter to tune. The only hyperparameter we tuned is the inverse temperature β in AWR for PI. We use β = 10 for Ant Maze tasks and β = {0.25, 5} for MuJoCo locomotion tasks (β = 0.25 for expert and medium-expert datasets, β = 5 for medium, medium-replay, random datasets). And following previous work (Brandfonbrener et al., 2021) , we clip exponentiated advantages to (-∞, 100]. All hyperparameters are included in Table 3 . In this section, we conduct ablation study on behavior policy. Like previous works (Fujimoto et al., 2019; Wu et al., 2022) , we consider to learn the behavior density β explicitly using conditional variational auto-encoder (Kingma & Welling, 2013; Sohn et al., 2015) . Specifically, β(a|s) can be approximated by a Deep Latent Variable Model p ω1 (a|s) = p ω1 (a|z, s)p(z|s)dz with prior p(z|s) = N (0, I). Rather than computing p ω1 (a|s) directly by marginalization, VAE construct a lower bound on the likelihood p ω1 (a|s) by introducing an approximate posterior q ω2 (z|a, s):p ω1 (a, z|s) q ω2 (z|a, s) ≥ E qω 2 (z|a,s) log p ω1 (a, z|s) q ω2 (z|a, s) = E qω 2 (z|a,s) [log p ω1 (a|z, s)] -KL [q ω2 (z|a, s)∥p(z|s)] def = J ELBO (s, a; ω).

(20)

It converts the difficult computation problem into an optimization problem. Instead of maximizing the log-likelihood log p ω1 (a|s) directly, we now optimize parameters ω 1 and ω 2 jointly by maximizing the evidence lower bound (ELBO) J ELBO (s, a; ω). After pre-training the VAE, we simply use J ELBO (s, a; ω) to approximate log β(a|s) in Eqn. ( 7). Using state-distribution correction d π (s) d β (s) might be helpful to IAC. For most RL settings, the dimension of the state is larger than that of the action. Since high dimensional estimation is challenging, it is difficult to estimate d π (s) and d β (s). Thus we sort to estimate d π (s) d β (s) by following the paper 'Infinite-Horizon Off-Policy Estimation'. We term this variant IAC-SR. The result is shown in Table 4 and Table 5 . It indicates that introducing state-distribution correction worsens the performance. One reason is that the approximation for d π (s) d β (s) is not accurate and it will introduce bias to the algorithm.

G.2 ABLATION ON SELF-NORMALIZED IMPORTANCE SAMPLING

Considering that self-normalized importance sampling also has a lower variance than importance sampling, we test the performance of a variant with self-normalized importance sampling. This variant is termed IAC-SNIS. The result of IAC-SNIS is shown in Table 4 and Table 5 . The selfnormalized importance sampling variant performs comparably to IAC on MuJoCo tasks but performs worse than IAC on AntMaze tasks.

G.3 ABLATION ON η

To study the hyperparameter η's effect on our proposed method, We run the experiments of using different η in {0.1, 0.3, 0.5, 0.7, 0.9}. The result is shown in Fig. 8 . It can be seen that the variant with a large η performs better than with a small η .

