THE IN-SAMPLE SOFTMAX FOR OFFLINE REINFORCE-MENT LEARNING

Abstract

Reinforcement learning (RL) agents can leverage batches of previously collected data to extract a reasonable control policy. An emerging issue in this offline RL setting, however, is that the bootstrapping update underlying many of our methods suffers from insufficient action-coverage: standard max operator may select a maximal action that has not been seen in the dataset. Bootstrapping from these inaccurate values can lead to overestimation and even divergence. There are a growing number of methods that attempt to approximate an in-sample max, that only uses actions well-covered by the dataset. We highlight a simple fact: it is more straightforward to approximate an in-sample softmax using only actions in the dataset. We show that policy iteration based on the in-sample softmax converges, and that for decreasing temperatures it approaches the in-sample max. We derive an In-Sample Actor-Critic (AC), using this in-sample softmax, and show that it is consistently better or comparable to existing offline RL methods, and is also wellsuited to fine-tuning. We release the code at github.

1. INTRODUCTION

A common goal in reinforcement learning (RL) is to learn a control policy from data. In the offline setting, the agent has access to a batch of previously collected data. This data could have been gathered under a near-optimal behavior policy, from a mediocre policy, or a mixture of different policies (perhaps produced by several human operators). A key challenge is to be robust to this data gathering distribution, since we often do not have control over data collection in some application settings. Most approaches in offline RL learn action-values, either through Q-learning updatesbootstrapping off of a maximal action in the next state-or for actor-critic algorithms where the action-values are updated using temporal-difference (TD) learning updates to evaluate the actor. In either case, poor action coverage can interact poorly with bootstrapping, yielding bad performance. The action-value updates based on TD involves bootstrapping off an estimate of values in the next state. This bootstrapping is problematic if the value is an overestimate, which is likely to occur when there are actions that are never sampled in a state (Fujimoto et al., 2018; Kumar et al., 2019; Fujimoto et al., 2019) . When using a maximum over actions, this overestimate will be selected, pushing up the value of the current state and action. Such updates can lead to poor policies and instability (Fujimoto et al., 2018; Kumar et al., 2019; Fujimoto et al., 2019) . There are two main approaches in offline RL to handle this over-estimation issue. One direction constrains the learned policy to be similar to the dataset policy (Wu et al., 2019; Peng et al., 2020; Nair et al., 2021; Brandfonbrener et al., 2021; Fujimoto & Gu, 2021) . A related idea is to constrain the stationary distribution of the learned policy to be similar to the data distribution (Yang et al., 2022) . The challenge with both these approaches is that they rely on the dataset being generated by an expert or near-optimal policy. When used on datasets from more suboptimal policies-like those commonly found in industry-they do not perform well (Kostrikov et al., 2022) . The other approach is bootstrap off pessimistic value estimates (Kidambi et al., 2020; Kumar et al., 2020; Kostrikov et al., 2021; Yu et al., 2021; Jin et al., 2021; Xiao et al., 2021) and relatedly to identify and reduce the influence of out-of-distribution actions using ensembles (Kumar et al., 2019; Agarwal et al., 2020; Ghasemipour et al., 2021; Wu et al., 2021; Yang et al., 2021; Bai et al., 2022) . One simply strategy that has been more recently proposed is to constrain the set of actions considered for bootstrapping to the support of the dataset D. In other words, if π D (a|s) is the conditional action distribution underlying the dataset, then we use max a :π D (a |s )>0 q(s , a ) instead of max a q(s , a ): a constrained or in-sample max. This idea was first introduced for Batch-Constrained Q-learning (BCQ) (Fujimoto et al., 2019) in the tabular setting, with a generative model used to approximate and sample π D (a|s) (Fujimoto et al., 2019; Zhou et al., 2020; Wu et al., 2022) . Implicit Q-learning (IQL) (Kostrikov et al., 2022) was the first model-free approximation to use this in-sample max, with a later modification to be less conservative (Ma et al., 2022) . IQL instead uses expectile regression, to push the action-values to predict upper expectiles that are a (close) lower bound to the true maximum. The approach nicely avoids estimating π D , and empirically performs well. Using only actions in the dataset is beneficial, because it can approach is be difficult to properly constrain the support of the learned model for π D and ensure it does not output out-of-distributions actions. There are, however, a few limitations to IQL. The IQL solution depends on the action distribution not just the support. In practice, we would expect IQL to perform poorly when the data distribution is skewed towards suboptimal actions in some states, pulling down the expectile regression targets. We find evidence for this in our experiments. Additionally, convergence is difficult to analyze because expectile regression does not have a closed-form solution. One recent work showed that the Bellman operator underlying an expectile value learning algorithm is a contraction, but only for the setting with deterministic transitions (Ma et al., 2022) . In this work, we revisit how to directly use the in-sample max. Our key insight is simple: sampling under support constraints is more straightforward for the softmax, in the entropy-regularized setting. We first define the in-sample softmax and show that it maintains the same contraction and convergence properties as the standard softmax. Further, we show that with a decreasing temperature (entropy) parameter, the in-sample softmax approaches the in-sample max. This formulation, therefore, is both useful for those wishing to incorporate entropy-regularization and to give a reasonable approximation to the in-sample max by selecting a small temperature. We then show that we can obtain a policy update that relies primarily on sampling from the dataset-which is naturally in-sample-rather than requiring samples from an estimate of π D . We conclude by showing that our resulting In-sample Actor-critic algorithm consistently outperforms or matches existing methods, despite being a notably simpler method, in offline RL experiments with and without fine-tuning.

2. PROBLEM SETTING

In this section we outline the key issue of action-coverage in offline RL that we address in this work.

2.1. MARKOV DECISION PROCESS

We consider finite Markov Decision Process (MDP) determined by M = {S, A, P, r, γ} (Puterman, 2014) , where S is a finite state space, A is a finite action space, γ ∈ [0, 1) is the discount factor, r : S × A → R and P : S × A → ∆(S) are the reward and transition functions. 1 The value function specifies the future discounted total reward obtained by following a policy π : S → ∆(A), v π (s) = E π [ ∞ t=0 γ t r(s t , a t )|s 0 = s] where we use E π to denote the expectation under the distribution induced by the interconnection of π and the environment. The corresponding action-value function is q π (s, a) = r(s, a) + γE s ∼P (•|s,a) [v π (s )]. There exists an optimal policy π * that maximizes the values for all states s ∈ S. We use v * and q * to denote the optimal value functions. The optimal value satisfies the Bellman optimality equation, v * (s) = max a r(s, a) + γE s [v * (s )] , q * (s, a) = r(s, a) + γE s ∼P (•|s,a) max a q * (s , a ) . (1) In this work we more specifically consider the entropy-regularized MDP setting-also called the maximum entropy setting-where an entropy term is added to the reward to encourage the policy to be stochastic. The maximum-entropy value function is defined as ṽπ (s) = v π (s) + τ H(s, π) , H(s, π) = E π ∞ t=0 -γ t log π(a|s) s 0 = s , for temperature τ and H the discounted entropy regularization. The corresponding maximum-entropy action-value function is qπ (s, a) = r(s, a) + γE s ∼P (s,a) [ṽ π (s )], with soft Bellman optimality equations similarly modified as described in the next section. As τ → 0, we recover the original value function definitions. The entropy-regularized setting has become widely used (Ziebart et al., 2008; Mnih et al., 2016; Nachum et al., 2017; Asadi & Littman, 2017; Haarnoja et al., 2018; Mei et al., 2019; Xiao et al., 2019) , because it 1) encourages exploration (Ziebart et al., 2008) , 2) often makes objectives more smooth (Mei et al., 2019) , and 3) provides these improvements even with small temperatures that do not significantly bias the solution to the original MDP (Song et al., 2019) .

2.2. OFFLINE REINFORCEMENT LEARNING

In this work, we consider the problem of learning an optimal decision making policy from a previously collected offline dataset D = {s i , a i , r i , s i } n-1 i=0 . We assume that the data is generated by executing a behavior policy π D . Note that we do assume direct access to π D . In offline RL, the learning algorithm can only learn from samples in this D without further interaction with the environment. One primary issue in offline RL is that π D may not have full coverage over actions. Greedy decisions based on a learned value q ≈ q * could be problematic, especially when the value is an overestimate for out-of-distribution actions (Fujimoto et al., 2019) . To overcome this issue, one popular approach is to constrain the learned policy to be similar to π D , such as by adding a KL-divergence term: max π E s∼ρ [ a π(a|s)q(s, a) -τ D KL (π(•|s)||π D (•|s))] for some τ > 0. The optimal policy for this objective must be on the support of π D : the KL constraint makes sure π(a|s) = 0 as long as π D (a|s) = 0 . This optimal policy, with closed-form solution π (a|s) ∝ π D (a|s) exp(q(s, a)/τ ), is also guaranteed to be an improvement on π D . Many offline RL algorithms are based on this nice idea (Wu et al., 2019; Peng et al., 2020; Nair et al., 2021; Brandfonbrener et al., 2021; Fujimoto & Gu, 2021) .foot_1 This KL constraint, however, can result in poor π when π D is sub-optimal, confirmed both in previous studies (Kostrikov et al., 2022) and our experimental results. The other strategy is to consider an in-sample policy optimization, max π π D a∈A π(a|s)q(s, a), where π π D indicates the support of π is a subset of π D . This approach more directly avoids selecting out-of-distribution actions. Though a simple idea, approximating this with a simple algorithm has been elusive, as discussed above. The simplest idea is to estimate π ω ≈ π D and directly constrain the support by sampling candidate actions from π ω , as proposed for Batch-Constrained Q-learning (Fujimoto et al., 2019) . This simple approach, however, may not avoid bootstrapping from out-of-sample actions due to the error in the estimate π ω . Surprisingly, the small modification to the in-sample softmax (Section 3) has not yet been considered for offline RL. Yet, moving from the in-sample (hard) max to the in-sample softmax facilitates developing a simple algorithm, as we discuss in the remainder of this work.

3. THE IN-SAMPLE SOFTMAX OPTIMALITY

This section introduces the in-sample softmax optimality that provides a simple implementation of in-sample bootstrapping. We first describe the standard soft Bellman optimality equations, then the modification to consider in-sample bootstrapping. Our simple algorithm comes from stepping back and recognizing the utility of considering in-sample bootstrapping for the entropy-regularized setting rather than only for the hard-max. The soft Bellman optimality equations for maximum-entropy RL use the softmax in place of the max, q * (s, a) = r(s, a) + γE s ∼P (•|s,a) τ log a∈A e q * (s ,a )/τ . (3) This comes from the fact that hard max with entropy regularization is max p∈∆(A) a∈A p(a)q(s, a)+ τ H(p) = τ log a∈A e q(s,a)/τ . As τ → 0, softmax (log-sum-exp) approaches the max. 3We can modify this update to restrict the softmax to the support of π D : q * π D (s, a) = r(s, a) + γE s ∼P (•|s,a)   τ log a :π D (a |s )>0 e q * π D (s ,a )/τ   . We call Eq. ( 4) the in-sample softmax optimality equation. It is interesting to note that we can use a simple reformulation that facilitates sampling the inner term. For any q, a:π D (a|s)>0 e q(s,a)/τ = a:π D (a|s)>0 π D (a|s)π D (a|s) -1 e q(s,a)/τ = a:π D (a|s)>0 π D (a|s)e -log π D (a|s) e q(s,a)/τ = E a∼π D (•|s) e q(s,a)/τ -log π D (a|s) . (5) This reformulation does not perfectly remove the role of π D (a|s), but it is significantly reduced. The support is no longer constrained using π D and instead the values are simply shifted by this term involving π D . We will use this strategy below to develop our algorithm. There are a few interesting facts to note about in-sample softmax. First, we can show that similarly to the standard maximum-entropy bootstrap (shown formally in Lemma 3), we have for any q τ log a:π D (a|s)>0 e q(s,a)/τ = max π π D a π(a|s)q(s, a) + τ H(π). Though this outcome is intuitive, it is a nice property that restricting the support of the log-sum-exp maintains the same relationship to the maximum-entropy update with the same support constraint. It extends this result for the soft Bellman optimality update to the setting with a support constraint. From this perspective, in-sample softmax can also be viewed as a tool for conservative exploration: exploring to prevent getting stuck in a local optima, while still being suspicious of what the data does not know. This is especially important when q is a learned value approximation. Second, we can also obtain a closed-form greedy policy using the above (shown formally in Lemma 3), which we call the in-sample softmax greedy policy: for any q, π π D ,q (a|s) ∝ π D (a|s) exp q(s, a) τ -log π D (a|s) , This closed-form solution looks similar to the KL-regularized solution mentioned in Section 2.2, where π is constrained to be similar to π D . The only difference is the additional -log π D term inside the exponential. This small difference, however, has a big impact. It allows the resulting policy to deviate much more from π D . In fact, because exp(-log π D (a|s)) = π D (a|s) -1 , the above is equivalent to π π D ,q (a|s) = 0 when π D (a|s) = 0 and otherwise π π D ,q (a|s) ∝ exp(q(s, a)/τ )foot_3 . The new policy ππ D is not skewed by the action probabilities in π π D ,q ; it just has the same support.

4. THEORETICAL CHARACTERIZATION OF IN-SAMPLE SOFTMAX

In this section we prove in-sample softmax maintains the convergence properties of the standard softmax. In particular, Bellman updates with the in-sample softmax are convergent, and the resulting in-sample softmax optimal policy approaches the in-sample optimal policy as we reduce the temperature to zero. All proofs are given in Appendix A. We can contrast our in-sample softmax optimality equation in (4) to the in-sample Bellman optimality equation introduced by Fujimoto et al. (2019) for the hard max, q * π D (s, a) = r(s, a) + γE s ∼P (•|s,a) max a :π D (a |s )>0 q * π D (s , a ) . We first show that the policy produced using the in-sample softmax optimality equation is a good approximation to that given by the in-sample Bellman optimality equation. Theorem 1. Let q * π D be the in-sample softmax optimal value function. We have lim τ →0 q * π D = q * π D . Moreover, let I be an indicator function and π π D (a|s) = I(a = arg max a:π D (a)>0 q * π D (s, a)) be the in-sample optimal policy w.r.t q * π D . Define the in-sample softmax optimal policy, π * π D (a|s) ∝ π D (a|s) exp q * π D (s, a) τ -log π D (a|s) . ( ) We have lim τ →∞ π * π D = π * π D . Now we show that we can reach the in-sample softmax optimal solution, using either value iteration or policy iteration. For value iteration, we define the in-sample softmax optimality operator (T π D q)(s, a) = r(s, a) + γE s ∼P (•|s,a)   τ log a :π D (a |s )>0 e q(s ,a )/τ   . ( ) The next result shows that T π D is a contraction, and therefore in-sample soft value iteration, using q t+1 = T π D q t , is guaranteed to converge to the in-sample softmax optimal value in the tabular case. Theorem 2. For γ < 1, the fixed point of the in-sample softmax optimality operator exists and is unique. Thus, in-sample soft value iteration converges to the in-sample softmax optimal value q * π D . As highlighted by Equation ( 6), the in-sample softmax policy corresponds to the solution of the maximum entropy policy optimization. This implies that similarly to Soft Actor-Critic (Haarnoja et al., 2018) , we can apply policy iteration to find this policy. Let π t be the policy at iteration t. The algorithm first learns the value function qπt , then updates the policy π t+1 such that qπt ≤ qπt+1 . The following result shows that this procedure guarantees policy improvement. Lemma 1. Let π t be a policy such that π t π D . Define π t+1 (a|s) ∝ π D (a|s) exp qπt (s, a) τ -log π D (a|s) . Then π t+1 π D and qπt+1 ≥ qπt . Note that π t+1 not only ensures policy improvement, but also stays in the support of π D . Now let us define the on-policy entropy-regularized operator, (T π q)(s, a) = r(s, a) + γE s ,a ∼P π (•|s,a) [q(s , a ) -τ log π(a |s )] . Since this operator is a contraction (shown formally in Lemma 5), we can evaluate qπ by repeatedly applying T π q from any q until converge. These updates give rise to the in-sample soft policy iteration algorithm that iteratively updates the policy using ( 11) and evaluate its by using T π . The convergence for the tabular case is given below. Theorem 3. For γ < 1, starting from any initial policy π such that π π D , in-sample soft policy iteration converges to the in-sample softmax optimal policy π * π D .

5. POLICY OPTIMIZATION USING THE IN-SAMPLE SOFTMAX

In this section, we develop an In-sample Actor-critic (AC) algorithm based on the in-sample softmax. This is the first time we see the utility of the in-sample softmax, to facilitate sampling actions from π D using only actions in the dataset. This contrasts other direct methods, like BCQ that approximate the in-sample max by sampling from an approximate π ω (Fujimoto et al., 2019) . Throughout this section we generically develop the algorithm for continuous and discrete actions. Instead of using sums, therefore, we primarily write formulas using expectations, which allow for either discrete or continuous actions. The In-sample AC algorithm is similar to SAC (Haarnoja et al., 2018) , except that we carefully consider out-of-sample actions. We similarly learn an actor π ψ with parameters ψ, action-values q θ with parameters θ and a value function v φ with parameters φ. Additionally, we learn π ω ≈ π D . We need this to define the greedy policy shown above in Equation ( 11), but do not directly use it to constrain the support over actions. The first step in the algorithm is to extract π ω ≈ π D . We do so using a simple maximum likelihood loss on the dataset: L behavior (ω) = -E (s,a)∼D [log π ω (a|s)]. We do not add any additional tricks to try to ensure action probabilities are zero where π D (a|s) = 0, because this π ω only plays a smaller role in our update. It will only be used to adjust the greedy policy, and will only be queried on actions in the dataset. Then we use a similar approach to SAC, where we alternate between estimating q θ and v φ for the current policy and improving the policy by minimizing a KL-divergence to the soft greedy policy. The main difference here to SAC is that we update towards the in-sample soft greedy. We cannot directly use Equation ( 11), which involves π D in the update, but can replace π D in the update with our approximation π ω . We therefore update towards an approximate in-sample soft greedy policy ππ D ,q θ (a|s) = π D (a|s) exp q θ (s, a) -Z(s) τ -log π ω (a|s) where Z(s) = τ log a π D (a|s) exp( q θ (s,a) τ -log π ω (a|s))da is the normalizer to give a valid distribution. We minimize a forward KL to this in-sample soft greedy policy, because that allows us to sample the KL by only sampling actions from the dataset. To see why, notice that D KL (π π D ,q θ (•|s)||π ψ (•|s)) = -E a∼ππ D ,q θ (•|s) [log π ψ (a|s) -log ππ D ,q θ (a|s)] (13) = E a∼π D (•|s) exp q θ (s, a) -Z(s) τ -log π ω (a|s) (log π ψ (a|s) + log ππ D ,q θ (a|s)) The expectation is now over samples a ∼ π D (•|s); the actions in the dataset are precisely sampled from π D . To sample the gradient for this loss, we also need an estimate for Z(s). We use our parameterized v φ to estimate Z; we discuss why this is reasonable below. The final loss function for the actor π ψ is L actor (ψ) = -E s,a∼D exp q θ (s, a) -v φ (s) τ -log π ω (a|s) log π ψ (a|s) . For the value function we use standard value function updates for the entropy-regularized setting. The objectives are L baseline (φ) = E s∼D,a∼π ψ (s) 1 2 (v φ (s) -(q θ (s, a) -τ log π ψ (a|s))) 2 (15) L critic (θ) = E s,a,r,s ∼D 1 2 (r + γv φ (s ) -q θ (s, a)) 2 . ( ) The action-values use the estimate of v φ in the next state, and so avoids using out-of-distribution actions. The update to the value function, v φ , uses only actions sampled from π ψ , which is being optimized to stay in-sample. Periodically, however, v φ may bootstrap off of out-of-distribution actions because we do not guarantee that π ψ π D . In fact, in early learning we expect π ψ will not satisfy this property. Despite this, the actor update will progressively reduce the probability of these out-of-distribution actions, even if temporarily the action-values overestimate their value, because the actor update pushes π ψ towards the in-sample greedy policy. This means that the overestimate is unlikely to significantly skew the actor, and progressively the overestimate should be reduced as the support of π ψ is reduced. Finally, instead of learning a separate approximation for Z, we opt for the simpler approach of using v φ . The reason is that v φ should provide a reasonable approximation to Z because of the relationship between soft values and Z. From Equations ( 7) and ( 8) (formally proved in Lemma 3), we know that the soft values for the in-sample soft greedy policy ππ D ,q θ correspond to the normalizer Z for that policy. Therefore, given that π ω ≈ π D , the soft values of approximate in-sample soft greedy policy ππ D ,q θ should also be similar to Z. Since we optimize our policy to approximate ππ D ,q θ , we expect its entropy-regularized value, which is the learning target of v φ as shown in Equation ( 15), to be a good approximation of Z.

6. EXPERIMENTS

In this section, we investigate three primary questions. First, in the tabular setting, can our algorithm InAC converge to a policy found by an oracle method that exactly eliminates out-of-distribution (OOD) actions when bootstrapping? Second, in Mujoco benchmarks, how does our algorithm compare with several baselines using different offline datasets with different coverage? Third, how does InAC compare with other baselines when used for online fine-tuning after offline training? We refer readers to Appendix B for additional details and supplementary experiments. Baseline algorithms: Oracle-Max: completely eliminates OOD actions when bootstrapping in tabular domains, by using counts to exactly estimate π D . FQI: the regular Q-learning update applied to batch offline data. CQL (Kumar et al., 2020) : conservative Q-learning. IQL (Kostrikov et al., 2022) : implicit Q-learning. TD3+BC (Fujimoto & Gu, 2021) : TD3 with behavior cloning regularization. AWAC (Nair et al., 2021) : Advantage Weighted Actor-Critic.

6.1. SANITY CHECK: APPROACHING ORACLE PERFORMANCE IN THE TABULAR SETTING

In this experiment we demonstrate that InAC finds the same policy as found by an oracle algorithm that completely removes out-of-distribution (OOD) actions. We use the Four Rooms environment, where the agent starts from the bottom-left and needs to navigate through the four rooms to reach the goal in the up-right corner in as few steps as possible. There are four actions: A = {up, down, right, lef t}. The reward is zero on each time step until the agent reaches the goal-state where it receives +1. Episodes are terminated after 100 steps, and γ is 0.9. We use three different behavior policies to collect three datasets from this environment called Expert, Random, and Missing-Action. The Expert dataset contains data collected by the optimal policy. In Random dataset, the behavior policy takes each action with equal probability. For the Missing-Action dataset, we removed all transitions taking down actions in the upper-left room from the Mixed dataset. To magnify the impact of bootstrapping from OOD actions we used optimistic initialization for each algorithm (i.e., initialized all action values to be larger than the actual values under the optimal policy). This ensures overestimation occurs in some states and we can observe how well the algorithms mitigate poor bootstrap targets. 

6.2. OOD EFFECTS IN CONTINOUS ACTION PROBLEMS

In this section we provide a suite of results from four Mujoco environments from D4RL (Fu et al., 2020) , now standard datasets for evaluating offline RL algorithms. Each dataset (named as Expert, Medium-Expert, Medium-Replay, and Medium) was designed to mimic different deployment scenarios. In the Expert dataset all trajectories were collected using a policy learned by a SAC agent. In Medium, all trajectories were collected with the policy learned by a SAC agent halfway thought training. Medium-Expert combines the expert and medium datasets together, and similarly Medium-Replay combines Medium with the replay buffer used during learning. Figure 2 summarizes each algorithm's performance averaged over all environments under different datasets. Our algorithm's performance dominates the others across datasets. In Figure 3 we provide a more detailed view of the data with learning curves in each environment. Overall InAC performs best or nearly so across all domains. In Hopper M-Expert, the result is likely a three-way tie, while in HalfCheetah M-Expert TD3+BC learns faster initially, but the quickly converges to much lower final performance compared with InAC. Naturally, all methods are dependent on the quality of the dataset. For example, when shifting from the higher quality (medium-expert) to the lower quality (medium-replay) data, TD3+BC-which regularizes the policy to stay close to the behavior policy-exhibits a significant performance drop. Overall, TD3+BC and CQL's performance is problem dependent: in some problems performing well and in others basically failing to learn. Finally IQL performs nearly as well as InAC on many problems, but notably not on Walker2D and Hopper M-Replay.

InAC

These results provide evidence that explicitly avoiding bootstrapping from OOD actions provides a significant benefit, but that regularizing the learned policy to stay close to the behavior policy can be problematic.

6.3. FROM OFFLINE TRAINING TO ONLINE FINE-TUNING

In real-world applications, it can be useful to take an offline-trained deep RL agent and fine-tune it online. In this section, we investigate how the performance of different baselines changes in fine-tuning. At the beginning of fine-tuning, the agent's policy is initialized with the policy learned offline and the buffer is filled with that same offline dataset. During online interactions, the agent continually adds its new experience into the buffer. Figure 4 shows the policy performance before and after online fine-tuning. We see that InAC is consistently one of the best algorithms across these environments and datasets. There are a few particularly notable outcomes in these experiments. In Hopper and Walker2d for the Medium-Expert data, the performance for IQL drops significantly after fine-tuning. This contrasts all the other algorithms, which maintained or improved performance when fine-tuning. The cause of this drop is as yet unclear. There is one new algorithm in this set, called AWAC, which was originally proposed specifically for online fine-tuning setting (Nair et al., 2021) . We do in-fact see that this algorithm can have quite poor offline performance, but significantly improve after fine-tuning. Despite being designed for this fine-tuning setting, however, it does not outperform the offline algorithms, except in Hopper with Medium-Replay and more minorly on Walker2d with Medium-Replay. Overall, we find that InAC performs well in both the fully offline setting as well as when incorporating fine-tuning.

7. CONCLUSION

In this paper we considered the problem of learning action-values and corresponding policies from a fixed batch of data. The algorithms designed for this setting need to account for the fact that action-coverage may be partial: certain actions may never be taken in certain regions of the state space. This complicates learning with our algorithms that rely on action-values estimates q(s, a) and bootstrapping. In particular, if an action a is not visited in s or similar states, the q(s, a) can be an arbitrary value. If this arbitrary value is high, it is likely to be used in the max in the bootstrap target and used to update the policy, which increases probability for high-value actions. This agent is chasing hallucinations, that can produce poor policies or even divergence. We focused on a simple approach to mitigate this issue: redefining the objectives to use an in-sample softmax and finally obtaining an approach to update towards an in-sample soft greedy policy that only uses actions sampled from the dataset. The resulting In-sample AC algorithm avoids these hallucinated values when updating the actor, and so correspondingly avoids them when updating the values. We had two clear findings from this work. First, the move to an in-sample softmax was a key step towards a simple implementation of in-sample learning. Previous work, like BCQ, tried to produce a simple algorithm built on an in-sample max algorithm, but needed to incorporate several tricks and later algorithms significantly improve on it. In-sample AC, on the other hand, required only minor modifications to existing AC approaches. The actor update was modified to consider the in-sample softmax, but the resulting update was no more complex than typical actor updates. Second, our results indicate that overall Implicit Q-learning (IQL) is quite a good algorithm. Like In-sample AC, it also avoids relying on actions sampled from an approximation of π D , but does so using expectile regression. Nonetheless, we find that In-sample AC is always competitive with IQL, and in some cases significantly outperforms it when the dataset is generated by a more suboptimal behavior policy. IQL can still be skewed by too many suboptimal actions in the dataset. In-sample AC provides a simple, easy-to-use approach, for both discrete and continuous actions, with an update designed to match only the support of π D and not the action probabilities.

A APPENDIX: PROOFS

This section includes the proof of all main results.

A.1 RESULTS FOR ONE-STEP DECISION MAKING

We first introduce some results for one-step decision making that will be used in the derivations of main results.

Maximum Entropy Optimization

We consider a k-armed one-step decision making problem. Let ∆ be a k-dimensional simplex and q = (q(1), . . . , q(k)) ∈ R k be the reward vector. Maximum entropy optimization considers max π∈∆ π • q + τ H(π) . ( ) The next result characterizes the solution of this problem (Lemma 4 of (Nachum et al., 2017) ). Lemma 2. For τ > 0, let F τ (q) = τ log a e q(a)/τ , f τ (q) = e q/τ a e q(a)/τ = e q-Fτ (q) τ . ( ) Then there is F τ (q) = max π∈∆ π • q + τ H(π) = f τ (q) • q + τ H(f τ (q)) . In-Sample Maximum Entropy Optimization Let β ∈ ∆ be an arbitrary policy. In-sample maximum entropy optimization considers max π β π • q + τ H(π) . ( ) We now characterize the solution of this problem. For τ > 0 define the in-sample softmax value, F β,τ (q) = τ log   a:β(a)>0 e q(a)/τ   , and the in-sample softmax policy, f β,τ (q) = βe q/τ -log β a:β(a)>0 e q(a)/τ = βe q-F β,τ (q) τ -log β . ( ) Lemma 3. F β,τ (q) = max π β π • q + τ H(π) = f β,τ (q) • q + τ H(f β,τ (q)) . ( ) Proof. This result is directly implied by Lemma 2. The next result shows that F β,τ is a contractor. Lemma 4. For any two vectors q 1 , q 2 ∈ R k , |F β,τ (q 1 ) -F β,τ (q 2 )| ≤ q 1 -q 2 ∞ . Proof. F β,τ (q 1 ) -F β,τ (q 2 ) = sup π1 β {π 1 • q 1 + τ H(π 1 )} -sup π2 β {π 2 • q 2 + τ H(π 2 )} (25) = sup π1 β inf π2 β π 1 • q 1 -π 2 • q 2 + τ H(π 1 ) -τ H(π 2 ) (26) ≤ sup π β {π • q 1 -π • q 2 } (27) ≤ max a:β(a)>0 q 1 (a) -q 2 (a) (28) ≤ max a q 1 (a) -q 2 (a) , where the first step follows by Lemma 3, the third step follows by choosing π 2 = π 1 . This finishes the proof.

A.2 RESULT FOR ON-POLICY ENTROPY-REGULARIZED BACKUP

In this section we show some basic results for on-policy entropy-regularized backup. We note that most results are generalized from Section C.2 of (Nachum et al., 2017) (which states for ṽ) to q. Recall that the entropy-regularized value functions are defined as qπ (s, a) = r(s, a) + γE s [ṽ π (s )] , ṽπ (s) = E π ∞ t=0 γ t (r(s t , a t ) -τ log π(a t |s t )) s 0 = s . Define the on-policy entropy-regularized Bellman operator (T π q)(s, a) = r(s, a) + γE s ,a ∼P π (•|s,a) [q(s , a ) -τ log π(a |s )] . Lemma 5. For any policy π, qπ satisfies that qπ = T π qπ . Moreover, suppose that |A| < ∞, T π is a contraction mapping. Proof. By the definition of ṽπ and qπ , ṽπ (s) = E π ∞ t=0 γ t (r(s t , a t ) -τ log π(a t |s t )) s 0 = s (32) = E π r(s 0 , a 0 ) -τ log π(a 0 |s 0 ) + γ ∞ i=0 γ i (r(s i+1 , a i+1 ) -τ log π(a i+1 |s i+1 )) s 0 = s (33) = E a∼π(s) r(s, a) -τ log π(a|s) + γE s E π ∞ t=0 γ t (r(s t , a t ) -τ log π(a t |s t )) s 0 = s (34) = E a∼π(s) r(s, a) -τ log π(a|s) + γE s ∼P (s,a) [ṽ π (s )] (35) = E a∼π(s) [q π (s, a) -τ log π(a|s)] . ( ) Thus qπ (s, a) = r(s, a) + E s [ṽ π (s )] (37) = r(s, a) + E s [E a∼π(s ) [q π (s , a ) -τ log π(a |s )]] (38) = r(s, a) + E s ,a ∼P π (s,a) [q π (s , a ) -τ log π(a |s )] (39) = T π qπ . ( ) This finishes the proof of the first part. Since |A| < ∞, log π(a|s) is bounded for any s, a. Then that T π is a contraction mapping follows directly from standard argument (Puterman, 2014) .

Published as a conference paper at ICLR 2023

This shows that qπ is a fixed of T π . That is, starting from any value q, we can learn qπ by repeatedly applying q = T π q. The next result characterizes the convergence rate of this algorithm. Lemma 6. For any π and q, we have (T π ) k q -qπ ∞ ≤ γ k q -qπ ∞ . ( ) Proof. We prove the result by induction. For the base case, k = 0, the result trivially follows. Now suppose that the result holds for k -1. Then (T π ) k q -qπ ∞ = max s,a (T π ) k q(s, a) -qπ (s, a) (42) = max s,a T π (T π ) k-1 q(s, a) -T π qπ (s, a) (43) = γ max s,a E s ,a ∼P π (s,a) (T π ) k-1 q(s , a ) -qπ (s , a ) (44) ≤ γ max s,a (T π ) k-1 q(s, a) -qπ (s, a) (45) = γ k q -qπ ∞ , where the second step uses Lemma 5, the third step uses the definition of T π , the fourth step uses the Holder's inequality, the last step uses the induction hypothesis. This finishes the proof. Finally, we also need the monotonicity property of the on-policy Bellman operation. Lemma 7. For any π, if q 1 ≥ q 2 , then T π q 1 ≥ T π q 2 . Proof. Assume q 1 ≥ q 2 and note that for any state-action s, a (T π q 1 )(s, a) -(T π q 2 )(s, a) = γE s ,a ∼P π (s,a) [q 1 (s , a ) -q 2 (s , a )] ≥ 0 . ( ) Policy improvement lemma. Lemma 8. Let π be a policy such that π β. Define π π (•|s) ∝ β(•|s) exp qπ (s, :) τ -log β(•|s) . ( ) Then π β and qπ ≥ qπ . Proof. The first part trivially holds by the definition of π . For the second part, note that by Lemma 3, for any state s ∈ S, π (•|s) • (q π (s, :) -τ log π(•|s)) ≥ π(•|s) • (q π (s, :) -τ log π(•|s)) . ( ) Then by Lemma 5, for any s, a ∈ S × A, qπ (s, a) = r(s, a) + γE s ,a ∼P π (•|s,a) [q π (s , a ) -τ log π(a |s )] (50) ≤ r(s, a) + γE s ,a ∼P π (•|s,a) [q π (s , a ) -τ log π (a |s )] (51) ≤ . . . ( ) ≤ qπ (s, a) , where we recursively apply Lemma 5 to expand the definition of qπ and apply Eq. (49).

A.3 RESULT FOR OFF-POLICY ENTROPY-REGULARIZED BACKUP

Given an arbitrary policy β, consider the following problem max (56) Lemma 9. For γ < 1, the fixed point of the in-sample softmax Bellman operator, q * = T * β q * , exists and is unique. Proof. We first show that T * β is a contraction. Let q 1 and q 2 be two value functions. Then T * β q 1 -T * β q 2 ∞ = γ max s,a T * β q 1 (s, a) -T * β q 2 (s, a) (57) = γ max s,a E s ∼P (•|s,a) [F β,τ (q 1 (s , :)) -F β,τ (q 2 (s , :))] (58) ≤ γ max s |F β,τ (q 1 (s, :)) -F β,τ (q 2 (s, :))| (59) ≤ γ max s,a |q 1 (s, a) -q 2 (s, a)| (60) = γ q 1 -q 2 , where the second step uses the definition of T * β , the third step uses Holder's inequality, the fourth step uses Lemma 4.

Note that by definition, q *

β is the fixed point of T * β . Lemma 10. If q is bounded and q ≥ T * β q, then for any π, q ≥ qπ . Proof. We first prove that for any π, q ≥ T * β q implies that q ≥ (T π ) k q for k ≥ 0. Then the result follows by applying Lemma 6. According to the assumption, q ≥ T * β q = r(s, a) + γE s [F β,τ (q(s , :))] (62) ≥ r(s, a) + γE s a π(a |s )(q(s , a ) -τ log π(a |s )) = T π q , (64) where the second inequality follows by Lemma 3. Then by Lemma 7, q ≥ T π q ≥ T π T * β q ≥ T π T π q ≥ • • • ≥ (T π ) k q . (65) This finishes the proof. We have the following key result. Lemma 11. For any s ∈ S, ṽ * β (s) = max π β ṽ * (s). Proof. We first show ṽ * β ≥ max π β ṽπ . Using the definitions, ṽ * β = F β,τ (q * β ) (66) = π * β • q * β -τ log π * β (67) ≥ π • q * β -τ log π (π β) (68) ≥ π • (q π -τ log π) (π β) (69) = ṽπ , ( ) where the second and third steps follow by Lemma 3, the fourth step follows by Lemma 10, the last step follows by the definition. We then prove max π β ṽπ ≥ ṽ * β by first showing that q * β = qπ * β . Since q * β is the fixed point T * β , by the uniqueness of the fixed point (Lemma 9), we only need to show that T * β qπ * β = qπ * β . This holds because for any (s, a), T * β qπ * β (s, a) = r(s, a) + γE s F β,τ (q π * β (s , :)) (71) = r(s, a) + γE s a π * β (a |s ) qπ * β (s , a ) -τ log π * β (a |s ) (72) = T π * β qπ * β (s, a) (73) = qπ * β (s, a) , where the second step uses Lemma 3 and the last step uses Lemma 5. Then, max π β ṽπ ≥ ṽπ * β = π * β • qπ * β -τ log π * β = π * β • q * β -τ log π * β = ṽ * β , where the second equality uses that q * β = qπ * β , the last step uses Lemma 3. This finishes the proof.

A.4 PROOF OF THEOREM 1

Theorem 4 (Restatement of Theorem 1). Let q * β be a value function recursively defined as q * β (s, a) = r(s, a) + γE s ∼P (•|s,a)   τ log a :β(a |s )>0 exp q * β (s , a )/τ   , and π * β be a policy defined as π * β (a|s) ∝ β(a|s) exp q * β (s, a)/τ -log β(a|s) . (77) Then we have q * β → q * β and π * β → π * β as τ → 0. Proof. By Lemma 11, we have q * β (s, a) = max π β q * (s, a) for any s, a. The result directly follows by definition of q * . The result for policy can be proved similarly.

B APPENDIX FOR EXPERIMENTS B.1 ADDITIONAL EXPERIMENTS

This section includes additional experiments to investigate the following questions. 1. We used optimistic initialization for all algorithms in the tabular domain. How do the algorithms perform when using a zero/pessimistic initialization? We show this in Figure 5 . In the meanwhile, we added Mixed dataset, which has 1% optimal trajectories and 99% random trajectories. 2. How do our algorithms work on the discrete action domains in the deep learning setting? We show the learning curves on Mountain Car, Lunar Lander, Acrobot in Figure 7 . The final performance is listed in Figure 8 with a normalized score, while the absolute score can be found in Figure 9 . 3. How do our algorithms work on more datasets in the continuous action domains? We put learning curves for the expert and medium dataset in Figure 10 , then list the performance of policies learned with all baselines and all datasets in Figure 11 with a normalized score, while the absolute score can be found in Figure 12 . For more Fine-tuning results, we put them in Figure 13 with a normalized score, while the absolute score can be found in Figure 14 . 4. Will longer run reduce the gap between InAC and baselines? Can InAC still learn better or similar policy compared to baselines, if we use another common batch size setting in Mujoco tasks (256)? We changed the batch size to 256 and increased the number of iteration to 1.2 million, then show the performance in Figure 15 . 5. How does InAC perform in AntMaze? We used antmaze-umaze-v0 and antmaze-umazediverse-v0 to test InAC, then added the learning curve comparing InAC to IQL in Figure 16 . We followed the set up in previous work (Kostrikov et al., 2022) .

B.2 REPRODUCING DETAILS

This section includes all experimental details to reproduce any empirical results in this paper. We use python version 3.9.6, gym version 0.10.0, pytorch version 1.10.0.

B.3 REPRODUCING DETAILS ON TABULAR DOMAIN

Four room environment: The environment is a 13 × 13 gridword, with walls separating the whole space into 4 rooms (as shown in Figure 6 ). The black area refers to the wall. The agent starts from the lower-left corner and learns to stay in the upper-right corner. When an agent runs into the wall, it returns to its previous state. The agent gets a +1 when it transits to the state in the upper-right corner and gets 0 otherwise. The discount rate is 0.9. Thus the upper bound of state value is 10. In tabular experiments, each trajectory was limited to 100 steps. τ was set to 0.01. A mini-batch update was used. The agent sampled 100 transitions at each iteration. Among the 4 datasets we used, mixed and random datasets used the random restart to ensure full state-action pairs coverage, while expert and missing-action datasets did not. Offline data collection: We used value iterations to find the optimal policy (10k iterations). For the expert dataset, we collect 10k transitions with the optimal policy. For the random dataset, we collected 10k transitions with a random restart and equal probability of taking each action. The mixed dataset consists of 100 transitions from the expert dataset and 9900 transitions from the random dataset. The missing action dataset is constructed by all going-down transitions in the upper-left room from the mixed dataset.

Algorithm parameter sweep:

The learning rate of InAC, Oracle-Max, and FQI was swept in [0.1, 0.03, 0.01, 0.003, 0.001]. Sarsa had a larger range: [0.1, 0.03, 0.01, 0.003, 0.001, 0.0003, 0.0001, 0.00003]. The τ of in-sample methods was fixed to 0.01.

B.4 REPRODUCING DETAILS OF DEEP RL ALGORITHMS

Network architecture: In mujoco tasks, we used 2 hidden layers with 256 nodes each for all neural networks. In discrete action environments, we used 2 hidden layers with 64 nodes each. Offline data generation details: In continuous control tasks, we used the datasets provided by D4RL. In discrete control tasks, we used a well-trained DQN agent to collect data. The DQN agent had 2 hidden layers with 64 nodes on each, with FTA (Pan et al., 2021) activation function on the last hidden layer and ReLU on others. In Acrobot, the agent was trained for 40k steps with batch size 64. In Lunar Lander and Mountain Car, we trained the agent for 500k and 60k steps separately, with other settings the same as in Acrobot. The expert dataset contains 50k transitions collected with the fixed policy learned by the DQN agent. The mixed dataset has 2k (4%) near-optimal transitions and 48k (96%) transitions collected with a randomly initialized policy. Offline training details: In all tasks, we used minibatch sampling, and the mini-batch size was set to 100. We used the ADAM optimizer and ReLU activation function. The target network is updated by using Polyak average: 0.995 × target weight + 0.005 × learning weight. We trained the agent for 0.8 million iterations and 70k iterations in mujoco and discrete action environments respectively. InAC learns reasonable policy with 1) expert trajectories, 2) missing action trajectories, 3) mixed trajectories, and 4) random trajectories, where 3) and 4) have full state-action coverage. The results were averaged over 10 random seeds, except that CQL had 5 seeds. The shaded area indicates 95% confidence interval. Fine Tuning details: We kept all settings as same as in offline learning, and used the learned policy as initialization. The offline data was filled into the buffer at the beginning of fine-tuning. New interactions would be appended to the buffer later in fine-tuning. No data were removed. The fine-tuning had 0.8M steps. Policy evaluation details: The policy was evaluated for 5 episodes in the true environment with a timeout setting. Acrobot and Lunar Lander had timeout=500, Mountain Car used 2000, and mujoco tasks used 1000. The numbers reported were averaged over 10 random seeds. Algorithm parameter setting: InSample FQI SARSA In each environment, we tested 2 datasets: Expert and Mixed. In expert dataset, all trajectories were collected with the near-optimal policy, while in mixed dataset, 4% trajectories were optimal and 96% were collected with a randomly initialized neural network policy. The y-axis is a normalized return reflecting the performance. The return was normalized according to returns obtained by a well trained DQN agent (upper bound) and a randomly initialized network (lower bound). The higher the normalized value, the better the performance. The curves were smoothed with window length10. The results were averaged over 10 random seeds. The shaded area indicates 95% confidence interval. Mujoco tasks: For all algorithms, the learning rate was swept in {3 × 10 -4 , 1 × 10 -4 , 3 × 10 -5 }. InAC swept τ in {1.0, 0.5, 0.33, 0.1, 0.01}. AWAC swept λ in {1.0, 0.5, 0.33, 0.1, 0.01}. IQL swept expectile in {0.9, 0.7} and temperature in {10.0, 3.0}. The number came from what was reported in the original IQL paper. TD3+BC used α = 2.5 as in the original paper. CQL-SAC used automatic entropy tuning as in the original paper.

Discrete action environments:

For all algorithms, the learning rate was swept in {0.003, 0.001, 0.0003, 0.0001, 3e -5, 1e -5}. For InAC, τ was swept in {1.0, 0.5, 0.1, 0.05, 0.01}. IQL had the same parameter sweeping range as in mujoco tasks. AWAC used λ = 1.0 as in the original paper. CQL used α = 5.0. 



We use the standard notation ∆(X ) to denote the set of probability distributions over a finite set X . Fujimoto & Gu (2021) use a behavior cloning regularization (π(s)-a) 2 , where a is action in the dataset. We note it is exactly a KL regularization under Gaussian parameterization with standard deviation.Brandfonbrener et al. (2021) propose a one-step policy improvement method: first learn the value of πD, then directly train a policy to maximize the learned value. Thus this is indeed a behavior regularized approach. Note that this softmax operator for the soft Bellman optimality equation is different from the softmax Bellman operator, which uses an expectation in the bootstrap over a softmax policy and which is know to have issue with not being an contraction(Asadi & Littman, 2017). The log-sum-exp formula is a standard way to approximate the max, and is naturally called a softmax. We define 0 • ∞ = 0



Figure 1: Policy evaluation performance (return per episode) v.s. number of updates on Expert, Random, and Missing-Action datasets. Each curve is averaged over 10 runs, and shaded areas show a 95% confidence interval.The results in Figure1are exactly as expected. InAC converges to the same policy as found by Oracle-Max. The FQI baseline cannot effectively remove OOD actions when bootstrapping, and so performs poorly and sometimes completely fails when the dataset has poor action coverage (i.e., there are many OOD actions). Finally, IQL performs poorly when the offline data is highly skewed towards suboptimal policies. It is likely because the upper expectile of the state value provides a poor approximation to the in-sample maximum action value.

Figure 2: Averaged score over environments v.s. different offline datasets. We averaged the normalized score over four Mujoco tasks and 10 runs. The shaded area indicates the 95% confidence interval. Comparing InAC and IQL with the sign test over 40 runs, InAC was significantly better in all datasets. Expert, M-Expert, and M-Replay had p-value close to 0, while Medium dataset gave p = 0.002.

Figure 4: Online fine-tuned performance on Medium-Expert and Medium-Replay datasets across four Mujoco environments. M represents Medium in this Figure. The results were averaged over 10 random seeds. The short vertical line indicates the range of 3 times standard error. Each colored bar shows the performance after 0.8M steps of fine-tuning. The thinner black bar inside the colored bar indicates the performance immediately after offline training (i.e., before online fine-tuning). We also report numerical numbers in the Table 13 in Appendix B.1.

0, define the In-sample softmax Bellman operator (T * β q)(s, a) = r(s, a) + γE s ∼P (•|s,a) τ log a :β(a |s )>0 exp (q(s , a )/τ ) (55) = r(s, a) + γE s ∼P (•|s,a) [F β,τ (q(s , :))]

Figure 5: Learning curves under different initialization on our four room gridworld tabular domain.(a) used 10 to initialize weights, (b) used 0, and (c) used -20. InAC learns reasonable policy with 1) expert trajectories, 2) missing action trajectories, 3) mixed trajectories, and 4) random trajectories, where 3) and 4) have full state-action coverage. The results were averaged over 10 random seeds, except that CQL had 5 seeds. The shaded area indicates 95% confidence interval.

Figure 6: Visualization of the learned policy of each algorithms and the estimated values at each state. The blue colors indicate the action value of the corresponding policy and the arrow indicates the action taken by the learned policy. A deeper color refers to a higher action value. It is clear that both FQI and SARSA have serious overestimation and found an incorrect policy when the offline data lacks action coverage (i.e., on the Expert and Missing-action offline data).

Figure10: Learning curve on the mujoco tasks. InAC learns the best policy or a comparable policy to the strongest baseline. The results were averaged over 10 random seeds, except that CQL had 5 seeds. The shaded area indicates 95% confidence interval.

Figure 13: The performance changes during fine-tuning. The number in bracket is the standard error. Scores are normalized. Performance was averaged over 10 random seeds.

Figure16: Offline learning curves with 1 million iterations. The x-axis is the number of iterations and the y-axis is the normalized score. Performance was averaged over 5 random seeds, after using a smoothing window of size 10.

Figure 8: The offline-trained final performance of each algorithm in discrete action space environments. The number in bracket is the standard error. Scores are normalized. The bold numbers are the best performance in the same setting. Performance was averaged over 10 random seeds.Figure9: The offline-trained final performance of each algorithm in discrete action space environments. This table reports the return per episode before normalization. The number in bracket is the standard error. The bold numbers are the best performance in the same setting. Performance was averaged over 10 random seeds.

Figure11: The final performance in continuous action space environments. The number in bracket is the standard error. Scores are normalized. The bold numbers are the best performance in the same setting. Performance was averaged over 10 random seeds, except that CQL had 5 seeds.Figure12: The absolute final performance in continuous action space environments. This table reports the score before normalization. The number in bracket is the standard error. The bold numbers are the best performance in the same setting. Performance was averaged over 10 random seeds, except that CQL had 5 seeds.



