THE IN-SAMPLE SOFTMAX FOR OFFLINE REINFORCE-MENT LEARNING

Abstract

Reinforcement learning (RL) agents can leverage batches of previously collected data to extract a reasonable control policy. An emerging issue in this offline RL setting, however, is that the bootstrapping update underlying many of our methods suffers from insufficient action-coverage: standard max operator may select a maximal action that has not been seen in the dataset. Bootstrapping from these inaccurate values can lead to overestimation and even divergence. There are a growing number of methods that attempt to approximate an in-sample max, that only uses actions well-covered by the dataset. We highlight a simple fact: it is more straightforward to approximate an in-sample softmax using only actions in the dataset. We show that policy iteration based on the in-sample softmax converges, and that for decreasing temperatures it approaches the in-sample max. We derive an In-Sample Actor-Critic (AC), using this in-sample softmax, and show that it is consistently better or comparable to existing offline RL methods, and is also wellsuited to fine-tuning. We release the code at github.

1. INTRODUCTION

A common goal in reinforcement learning (RL) is to learn a control policy from data. In the offline setting, the agent has access to a batch of previously collected data. This data could have been gathered under a near-optimal behavior policy, from a mediocre policy, or a mixture of different policies (perhaps produced by several human operators). A key challenge is to be robust to this data gathering distribution, since we often do not have control over data collection in some application settings. Most approaches in offline RL learn action-values, either through Q-learning updatesbootstrapping off of a maximal action in the next state-or for actor-critic algorithms where the action-values are updated using temporal-difference (TD) learning updates to evaluate the actor. In either case, poor action coverage can interact poorly with bootstrapping, yielding bad performance. The action-value updates based on TD involves bootstrapping off an estimate of values in the next state. This bootstrapping is problematic if the value is an overestimate, which is likely to occur when there are actions that are never sampled in a state (Fujimoto et al., 2018; Kumar et al., 2019; Fujimoto et al., 2019) . When using a maximum over actions, this overestimate will be selected, pushing up the value of the current state and action. Such updates can lead to poor policies and instability (Fujimoto et al., 2018; Kumar et al., 2019; Fujimoto et al., 2019) . There are two main approaches in offline RL to handle this over-estimation issue. One direction constrains the learned policy to be similar to the dataset policy (Wu et al., 2019; Peng et al., 2020; Nair et al., 2021; Brandfonbrener et al., 2021; Fujimoto & Gu, 2021) . A related idea is to constrain the stationary distribution of the learned policy to be similar to the data distribution (Yang et al., 2022) . The challenge with both these approaches is that they rely on the dataset being generated by an expert or near-optimal policy. When used on datasets from more suboptimal policies-like those commonly found in industry-they do not perform well (Kostrikov et al., 2022) . The other approach is bootstrap off pessimistic value estimates (Kidambi et al., 2020; Kumar et al., 2020; Kostrikov et al., 2021; Yu et al., 2021; Jin et al., 2021; Xiao et al., 2021) and relatedly to identify and reduce the influence of out-of-distribution actions using ensembles (Kumar et al., 2019; Agarwal et al., 2020; Ghasemipour et al., 2021; Wu et al., 2021; Yang et al., 2021; Bai et al., 2022) . One simply strategy that has been more recently proposed is to constrain the set of actions considered for bootstrapping to the support of the dataset D. In other words, if π D (a|s) is the conditional action distribution underlying the dataset, then we use max a :π D (a |s )>0 q(s , a ) instead of max a q(s , a ): a constrained or in-sample max. This idea was first introduced for Batch-Constrained Q-learning (BCQ) (Fujimoto et al., 2019) in the tabular setting, with a generative model used to approximate and sample π D (a|s) (Fujimoto et al., 2019; Zhou et al., 2020; Wu et al., 2022) . Implicit Q-learning (IQL) (Kostrikov et al., 2022) was the first model-free approximation to use this in-sample max, with a later modification to be less conservative (Ma et al., 2022) . IQL instead uses expectile regression, to push the action-values to predict upper expectiles that are a (close) lower bound to the true maximum. The approach nicely avoids estimating π D , and empirically performs well. Using only actions in the dataset is beneficial, because it can approach is be difficult to properly constrain the support of the learned model for π D and ensure it does not output out-of-distributions actions. There are, however, a few limitations to IQL. The IQL solution depends on the action distribution not just the support. In practice, we would expect IQL to perform poorly when the data distribution is skewed towards suboptimal actions in some states, pulling down the expectile regression targets. We find evidence for this in our experiments. Additionally, convergence is difficult to analyze because expectile regression does not have a closed-form solution. One recent work showed that the Bellman operator underlying an expectile value learning algorithm is a contraction, but only for the setting with deterministic transitions (Ma et al., 2022) . In this work, we revisit how to directly use the in-sample max. Our key insight is simple: sampling under support constraints is more straightforward for the softmax, in the entropy-regularized setting. We first define the in-sample softmax and show that it maintains the same contraction and convergence properties as the standard softmax. Further, we show that with a decreasing temperature (entropy) parameter, the in-sample softmax approaches the in-sample max. This formulation, therefore, is both useful for those wishing to incorporate entropy-regularization and to give a reasonable approximation to the in-sample max by selecting a small temperature. We then show that we can obtain a policy update that relies primarily on sampling from the dataset-which is naturally in-sample-rather than requiring samples from an estimate of π D . We conclude by showing that our resulting In-sample Actor-critic algorithm consistently outperforms or matches existing methods, despite being a notably simpler method, in offline RL experiments with and without fine-tuning.

2. PROBLEM SETTING

In this section we outline the key issue of action-coverage in offline RL that we address in this work.

2.1. MARKOV DECISION PROCESS

We consider finite Markov Decision Process (MDP) determined by M = {S, A, P, r, γ} (Puterman, 2014) , where S is a finite state space, A is a finite action space, γ ∈ [0, 1) is the discount factor, r : S × A → R and P : S × A → ∆(S) are the reward and transition functions. 1 The value function specifies the future discounted total reward obtained by following a policy π : S → ∆(A), v π (s) = E π [ ∞ t=0 γ t r(s t , a t )|s 0 = s] where we use E π to denote the expectation under the distribution induced by the interconnection of π and the environment. The corresponding action-value function is q π (s, a) = r(s, a) + γE s ∼P (•|s,a) [v π (s )]. There exists an optimal policy π * that maximizes the values for all states s ∈ S. We use v * and q * to denote the optimal value functions. The optimal value satisfies the Bellman optimality equation, v * (s) = max In this work we more specifically consider the entropy-regularized MDP setting-also called the maximum entropy setting-where an entropy term is added to the reward to encourage the policy to



We use the standard notation ∆(X ) to denote the set of probability distributions over a finite set X .



a r(s, a) + γE s [v * (s )] , q * (s, a) = r(s, a) + γE s ∼P (•|s,a) max a q * (s , a ) . (1)

