OPTIMAL CONSERVATIVE OFFLINE RL WITH GENERAL FUNCTION APPROXIMATION VIA AUGMENTED LAGRANGIAN

Abstract

Offline reinforcement learning (RL), which aims at learning good policies from historical data, has received significant attention over the past years. Much effort has focused on improving offline RL practicality by addressing the prevalent issue of partial data coverage through various forms of conservative policy learning. While the majority of algorithms do not have finite-sample guarantees, several provable conservative offline RL algorithms are designed and analyzed within the single-policy concentrability framework that handles partial coverage. Yet, in the nonlinear function approximation setting where confidence intervals are difficult to obtain, existing provable algorithms suffer from computational intractability, prohibitively strong assumptions, and suboptimal statistical rates. In this paper, we leverage the marginalized importance sampling (MIS) formulation of RL and present the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability, bypassing the need for uncertainty quantification. We identify that the key to successfully solving the sample-based approximation of the MIS problem is ensuring that certain occupancy validity constraints are nearly satisfied. We enforce these constraints by a novel application of the augmented Lagrangian method and prove the following result: with the MIS formulation, augmented Lagrangian is enough for statistically optimal offline RL. In stark contrast to prior algorithms that induce additional conservatism through methods such as behavior regularization, our approach provably eliminates this need and reinterprets regularizers as "enforcers of occupancy validity" than "promoters of conservatism."

1. INTRODUCTION

The goal of offline RL is to design agents that learn to achieve competence in a task using only a previously-collected dataset of interactions (Lange et al., 2012) . Offline RL is a promising tool for many critical applications, from healthcare to autonomous driving to scientific discovery, where the online mode of learning by interacting with the environment is dangerous, impractical, costly, or even impossible (Levine et al., 2020) . Despite this, offline RL has not yet been truly successful in practice (Fujimoto et al., 2019) and impressive RL performance has been limited to settings with known environments (Silver et al., 2017; Moravčík et al., 2017) , access to accurate simulators (Mnih et al., 2015; Degrave et al., 2022; Fawzi et al., 2022) , or expert demonstrations (Vinyals et al., 2017) . One of the central challenges in offline RL is the lack of uniform coverage in real datasets and the distribution shift between the occupancy of candidate policies and offline data distribution, which pose difficulties in accurately evaluating the candidate policies. Over the past years, a body of literature has focused on addressing this challenge through developing conservative algorithms, which aim at picking a policy among those well-covered in the data. On the practical front, various forms of conservatism are proposed such as behavior regularization through policy constraints (Kumar et al., 2019; Fujimoto et al., 2019; Nachum & Dai, 2020) , learning conservative values (Kumar et al., 2020; Liu et al., 2020; Agarwal et al., 2020) , or learning pessimistic models (Kidambi et al., 2020; Yu et al., 2020; 2021) ; see Appendix B for further discussion on related work. From a theoretical standpoint, partial data coverage has recently been studied within variants of the single-policy concentrability framework (Rashidinejad et al., 2021; Xie et al., 2021; Uehara & Sun, 2021) , which characterizes the distribution shift between offline data and occupancy of a target (often optimal) policy, in contrast to all-policy concentrability commonly used in earlier works (Scherrer, 2014; Chen & Jiang, 2019; Liao et al., 2020; Zhang et al., 2020a; Xie & Jiang, 2021) . Within this framework and in the tabular and linear function approximation settings, pessimistic algorithms that leverage uncertainty quantifiers to construct lower confidence bounds (Jin et al., 2021; Rashidinejad et al., 2021; Yin et al., 2021; Shi et al., 2022; Li et al., 2022) enjoy optimal statistical rate. In the general function approximation setting, pessimistic algorithms largely assume oracle access to uncertainty quantification, either for constructing penalties that are subtracted from rewards (Jin et al., 2021; Jiang & Huang, 2020) or selecting the most pessimistic option among those that fall within the confidence region implied by the offline data (Uehara & Sun, 2021; Xie et al., 2021; Chen & Jiang, 2022) . However, uncertainty quantifiers are difficult to obtain in non-linear function approximation and existing heuristics are empirically observed to be unreliable (Rashid et al., 2019; Tennenholtz et al., 2021; Yu et al., 2021) . Recent works by Cheng et al. (2022) and Zhan et al. (2022) propose provable alternatives to uncertainty-based methods, but leave achieving optimal statistical rate of 1/ √ N , where N is the dataset size, as an open problem. Among all, the marginal importance sampling (MIS) methods, which aim at learning weights w that estimate the distribution shift between induced policy occupancy d w and data distribution µ, lend themselves well to the single-policy concentrability framework. Though more popular in off-policy evaluation (Liu et al., 2018; Xie et al., 2019; Uehara et al., 2020; Zhang et al., 2020b) , MIS has also been used for conservative offline RL such as AlgaeDICE (Nachum et al., 2019b) and OptiDICE (Lee et al., 2021) , both of which incorporate behavior regularization. Recently, Zhan et al. (2022) theoretically studied a variant of OptiDICE, showing that MIS with behavior regularization enjoys finite-sample guarantees (though achieving a suboptimal 1/N 1/6 rate) and circumvents certain fundamental difficulties observed in value-based offline RL with function approximation (Du et al., 2019; Wang et al., 2020; 2021; Weisz et al., 2021; Zanette, 2021; Foster et al., 2021) .

1.1. CONTRIBUTIONS AND RESULTS

Motivated by the benefits offered by MIS, we study designing statistically optimal offline learning algorithms under the MIS formulation with general function approximation and single-policy concentrability. We conduct theoretical investigations and design algorithms starting from multi-armed bandits (MABs), going forward to contextual bandits (CBs), and finally Markov decision processes (MDPs). In the rest of this section, we present a preview of our contributions and results. Multi-armed bandits. Empirical MIS algorithms often incorporate behavior regularization, whose role is justified as promoting conservatism by keeping the occupancies of learned and behavior policies close (Nachum et al., 2019b; Lee et al., 2021 ). Yet, whether and why these regularizers are necessary from a theoretical perspective remain unclear. Zhan et al. (2022) motivates behavior regularization as a way of introducing curvature in an otherwise linear optimization problem. We extensively investigate the effect of regularization, starting from the simplest setting of MABs with function approximation, as existing algorithms when specialized to offline MABs, are either intractable, have suboptimal finite-sample guarantees, or require access to uncertainty quantifiers. We state our results on offline MABs with general function approximation and single-policy concentrability in the informal theorem below.

Theorem (informal) (I)

There exists an offline MAB instance where the unregularized MIS fails to achieve a suboptimality that decays with N . (II) MIS with behavior regularization (PRO-MAB Algorithm 1) achieves O(1/ √ N ) suboptimality. (III) If one searches only over the space of weights that induce valid occupancies (d w = 1), then unregularized MIS achieves O(1/ √ N ) suboptimality. Here, we prove that unregularized MIS fails even in bandits and provide a tight analysis of PRO-MAB, a special case of PRO-RL algorithm, improving over the original 1/N 1/6 rate shown by Zhan et al. (2022) . In our analysis, we find that the key to the success of PRO-MAB is near-validity of the learned occupancy d w . In MABs, the validity constraint simply requires the learned occupancy to be a probability distribution: d w = a w(a)µ(a) = 1, where a is an arm. With a proper choice of hyperparameter, we show that behavior regularization enforces learned occupancy to be nearly valid: d w = Ω(1). We further prove that regularization is not required if validity is otherwise satisfied. Given that occupancy validity is the constraint of the optimization problem solved by MIS (see e.g. (1)), we ask whether there are any methods for solving empirical optimization problems that find more constraint-adhering solutions compared to those yielded by Lagrange multipliers adopted in prior works (Lee et al., 2021; Zhan et al., 2022) . The augmented Lagrangian method (ALM), which adds a quadratic loss on the constraints (d w -1) 2 , is a natural choice for our purpose. The ALM term can be easily estimated from offline data and forms Algorithm 1. We show that ALM results in d w = Ω(1), ensuring near-validity of learned occupancy and leading to the following guarantee. Theorem (informal) The policy returned by an algorithm that combines ALM with MIS (ALMIS) for offline MABs (Algorithm 1) achieves O(1/ √ N ) suboptimality. ALMIS offers several benefits over PRO-MAB such as eliminating the need for picking the regularizer and only requiring single-policy concentrability instead of the two-policy requirement of PRO-MAB, which can be strong (Section 5.3). Additionally, behavior regularization introduces bias in the solution even with infinite data (Chen & Jiang, 2022) and the bias-variance tradeoff must be carefully handled. However, ALM merely enforces the optimization constraints and leads to provably unbiased solutions (Lemma 14). More importantly, as we see shortly, going beyond the single-state MABs, behavior regularization becomes suboptimal while ALMIS maintains optimality.

Contextual bandits.

In offline CBs, we analyze two approaches: MIS with behavior regularization, and an extension of ALMIS. We state our results in the following informal theorem. Informally, the failure of PRO-CB to achieve the optimal rate is because the regularization parameter has to be small to control bias, but such small regularization is not strong enough to ensure the validity of learned occupancy in most states. Therefore, one must choose larger regularization, leading to an overall suboptimal rate. Prior works Chen & Jiang (2022); Cheng et al. (2022) also allude to this phenomenon, explaining that regularizers appear to be the culprit behind suboptimal rates.

Theorem (informal) (I)

In CBs, the occupancy validity constraints require conditional occupancy to be a valid probability distribution in every state. In Algorithm 2, we incorporate ALM in offline CBs by adding a weighted sum of quadratic losses describing the validity constraint in each state, where the weights are set to the state occupancies to capture their relative importance. Enforcement of the constraints by ALM without introducing any bias is the key to the optimality of our algorithm. MDPs. Validity constraints in MDPs ensure that the learned state occupancy d w (s) = a w(s, a)µ(s, a) is close to the actual state occupancy d πw (s), where π w is the policy computed from weights w.foot_0 Directly enforcing this constraint results in an ALM term that cannot easily be estimated from offline data. We address this difficulty by expressing the ALM term in the variational form. From there, we derive two variants, one model-based and one model-free, of the ALMIS algorithm for offline RL, that enjoy the following guarantee. Theorem (informal) Both variants of ALMIS for offline RL achieve O(1/ √ N ) suboptimality. This marks ALMIS as the first practical and statistically optimal offline RL algorithm that operates in the general function approximation and partial data coverage setting, while avoiding uncertainty quantification and additional regularizers. Conservatism of ALMIS is baked into the MIS formulation and supported by the ALM: bounded MIS weights prevent learned occupancy to deviate significantly from data distribution, and ALM ensures closeness of the learned and actual occupancies. When combined, ALMIS learns a policy whose actual occupancy is close to the data distribution. We thus proved that ALM improves sample complexity compared to alternatives such as behavior regularization. This is in addition to the benefits on optimization stability that are likely to be offered by the ALM, as the ALM improves over the ill-posed Lagrange multiplier objective (Ben-Tal & Nemirovski, 2022) . Our theoretical findings can explain the empirical observations of Yang et al. (2020) , who find MIS with behavior regularization to be unstable and propose regularizers in "the spirit of ALM" that gain superior performance and attribute performance gain to improved optimization. In this work, we present a theoretically-founded way of introducing ALM in offline RL and our analysis shows that ALM also leads to optimal sample complexity.

2. BACKGROUND

Markov decision process. An infinite-horizon discounted MDP is described by a tuple M = (S, A, P, R, ρ, γ), where S is the state space, A is the action space, P : S ×A → ∆(S) is the transition kernel, R : S × A → ∆([0, 1]) encodes a family of reward distributions with r : S × A → [0, 1] as the expected reward function, ρ : S → ∆(S) is the initial state distribution, and γ ∈ [0, 1) is the discount factor. We assume S and A are finite however, our results do not depend on their cardinalities and can be naturally extended to infinite sets. A stationary (stochastic) policy π : S → ∆(A) specifies a distribution over actions in each state. Each policy π induces an occupancy density over state-action pairs d π : S × A → [0, 1] defined as d π (s, a) := (1 -γ) ∞ t=0 γ t P t (s t = s, a t = a; π), where P t (s t = s, a t = a; π) denotes (s, a) visitation probability at step t, starting at s 0 ∼ ρ(•) and following π. We abuse notation and also write d π (s) = a∈A d π (s, a) to denote the discounted state occupancy. Additionally, operator P π is applied to any function u : S × A → R and is defined as (P π u)(s, a) := s ′ ,a ′ P (s ′ |s, a)π(a ′ |s ′ )u(s ′ , a ′ ). An important quantity is the value a policy π, which is the discounted sum of rewards V π (s r(s, a) ] to represent a scalar summary of the performance of a policy π. We denote by π ⋆ an optimal policy that maximizes the above objective and use the shorthand ) := E[ ∞ t=0 γ t r t | s 0 = s, a t ∼ π(• | s t ), ∀ t ≥ 0] starting at s ∈ S. Q-function Q π (s, a) of a policy is similarly defined. We write J(π) := (1 -γ)E s∼ρ [V π (s)] = E s,a∼d π [ V ⋆ := V π ⋆ to denote the optimal value function. Offline reinforcement learning. We focus on the offline RL, where the agent is only provided with a previously-collected offline dataset D = {(s i , a i , r i , s ′ i )} N i=1 . Here, r i ∼ R(s i , a i ), s ′ i ∼ P (• | s i , a i ), and we assume s i , a i pairs are generated i.i.d. according to a data distribution µ ∈ ∆(S ×A). To streamline the analysis, we assume that the conditional distribution µ(a|s) is known. 2 The goal of offline RL is to learn a policy π based on the offline dataset so as to minimize the sub-optimality with respect to an optimal policy π ⋆ , i.e. J(π ⋆ ) -J(π) with high probability. Marginalized importance sampling. In this paper, we consider marginalized importance sampling (MIS) formulation that aims at learning weights w(s, a) to represent policy occupancy when multiplied by data distribution: d w (s, a) = w(s, a)µ(s, a). Also denote d w (s) = a∈A d w (s, a). We define the policy induced by w as π w (a|s) = d w (s, a)/d w (s) for d w (s) > 0 and π w (a|s) = 1/|A| for d w (s) = 0. Offline data coverage assumption. We design and analyze our algorithms within the single-policy concentrability framework (Rashidinejad et al., 2021) , stated below. Definition 1 (Single-policy concentrability) Given a policy π, define C π to be the smallest constant that satisfies d π (s,a) µ(s,a) ≤ C π for all s ∈ S and a ∈ A. C π ⋆ = C ⋆ captures coverage of π ⋆ in the offline data and is much weaker than the widely used all-policy concentrability that assumes bounded max π C π ; see Appendix B for further discussion. Notation. We write ∆(S) to denote the probability simplex over a set S. For a function class F, we write |F| to denote its complexity (such as cardinality in the discrete case or covering number in the continuous case). We use the notation x ≲ y when there exists constant c > 0 such that x ≤ cy and  x ≍ y if constants c 1 , c 2 > 0 exist such that c 1 |x|≤ |y|≤ c 2 |x|. We write f (x) = O(g(x)) if M > 0, x 0 exist such that |f (x)|≤ M g(x)

3. MULTI-ARMED BANDITS

We start by considering the offline learning problem in the multi-armed bandit (MAB) setting, which is a special case of MDP with γ = 0, |S|= 1, and D = {(a i , r i )} N i=1 , where a i ∼ µ(•), r i ∼ R(a i ). The goal of offline learning in MABs can be described as the following constrained optimization problem, where d represents occupancy over actions (arms) max d≥0 E a∼d [r(a)] s.t. a d(a) = 1. (1)

3.1. PRIMAL-DUAL REGULARIZED OFFLINE BANDITS

To solve (1), the MIS approach with behavior regularization defines importance weights w(a) = d(a)/µ(a) and converts the problem (1) to its dual form by introducing the Lagrange multiplier v: max w≥0 min v L MAB α (w, v) := E a∼µ [w(a)r(a)] -v (E a∼µ [w(a)] -1) -αE a∼µ [f (w(a))] . The last term in (2) is the behavior regularizer that characterizes the f -divergence between the learned occupancy d and data distribution µ. This term was originally proposed to induce conservatism by keeping the learned policy close to behavior policy (Nachum et al., 2019b; Lee et al., 2021) . Problem (2) satisfies strong duality and we denote optimal primal and dual variables by w ⋆ α and v ⋆ α (Appendix C.2). When α = 0, weights w ⋆ := w ⋆ 0 (might not be unique) induce optimal policy and v ⋆ := v ⋆ 0 is the optimal reward. Approximating w and v to belong to classes W ⊆ R |A| and V ∈ R and solving the empirical version of (2) yields primal-dual regularized offline MAB (PRO-MAB Algorithm 5), a special case of PRO-RL algorithm of Zhan et al. (2022) . One might wonder whether the unregularized algorithm (α = 0) is sufficient for solving the offline learning problem in MABs, particularly under the natural and common assumption that elements of the function class W are bounded: w(a) = d(a)/µ(a) ≤ B w . In the following proposition, we show that the answer is negative and there exist a MAB instance in which the unregularized algorithm finds a policy that suffers from a constant suboptimality. The proof is provided in Appendix C.3. Proposition 1 (Unregularized MIS fails in MABs) Assume 0 ≤ w(a) ≤ B w for w ∈ W and |v|≤ B v for v ∈ V. Suppose realizability of any one of w ⋆ ∈ W and v ⋆ ∈ V and concentrability of π ⋆ := π w ⋆ . For any N ≥ 2, there exists a MAB instance where π returned by Algorithm 5 with α = 0 satisfies J(π ⋆ ) -J(π) = 1/6 with a constant probability. We note that Zhan et al. (2022) also argues the failure of the unregularized algorithm by giving a counterexample in the MDP setting. We discuss this example in detail in Section 5.3. Proposition 1 reveals additional insights: the objective ( 16) with α = 0 fails not just in MDPs but also in bandits, even when the optimal policy is unique and data are collected by running a behavior policy. Given the failure of the unregularized algorithm, we conduct a tight analysis of PRO-MAB with α > 0. In the next theorem, we prove that under similar assumptions as Zhan et al. (2022) and with a proper choice of α, PRO-MAB returns a policy that enjoys optimal sample complexity. Theorem 1 (Suboptimality of PRO-MAB) Let f be M f -strongly convex and non-negative with bounded value |f (x)|≤ B f and derivative |f ′ (x)|≤ B f ′ . Assume 0 ≤ w(a) ≤ B w for w ∈ W and |v|≤ B v for v ∈ V. Fix δ ≥ 0 and set α ≍ B w (B v + 1) + B f )/M f log(|V||W|/δ)/N . Suppose realizability of w ⋆ α ∈ W and v ⋆ α ∈ V and concentrability of π ⋆ := π w ⋆ and π ⋆ α := π w ⋆ α . Then, with probability at least 1 -δ, policy π returned by Algorithm 5 achieves J(π ⋆ ) -J(π) ≲ (B f + B w (B v + 1))(B f + B f ′ B w ) M f log(|V||W|/δ) N . To our knowledge, this is the first statistically optimal guarantee for a practical offline MAB algorithm with function approximation and partial coverage and improves over the 1/N 1/6 guarantee given by Zhan et al. (2022) . We now briefly explain the differences between the analysis methods; a complete proof is deferred to Appendix C.4. Zhan et al. (2022) bounds policy suboptimality by α+1/(α 1/2 N 1/4 ), where the first term stems from the bias caused by the regularizer and the second term emerges from bounding the difference of ŵ and w ⋆ α via strong convexity of L α . Optimizing the bound over α gives the final 1/N 1/6 guarantee. In contrast, our analysis connects suboptimality to occupancy validity. We prove that suboptimality is bounded by α + 1/(d ŵ√ N ), where d ŵ = a ŵ(a)µ(a). We then show that setting α ≍ 1/ √ N is sufficient to ensure near-validity of occupancy d ŵ = Ω(1), yielding the optimal rate. We observe a similar phenomenon in Proposition 1 that small d w for certain w ∈ W can cause the unregularized MIS to fail. In the following section, we investigate this phenomenon further, leading to a new offline learning algorithm.

3.2. AUGMENTED LAGRANGIAN REPLACES BEHAVIOR REGULARIZATION

The next proposition further affirms the importance of policy validity and shows that if the occupancy is valid, such as by searching only over the weights that induce valid occupancies, then the unregularized algorithm enjoys an optimal rate. Proof of this result can be found in Appendix C.5. Algorithm 1 ALM with MIS (ALMIS) for offline MAB 1: Inputs: Dataset D = {(a i , r i )} N i=1 , classes W and V. 2: Find a solution ŵ, v to the following problem max w∈W min v∈V LMAB AL (w, v) := 1 N N i=1 w(a i )r i -v(w(a i ) -1) - 1 N N i=1 w(a i ) -1 2 . (3) 3: Return: π = π ŵ. Proposition 2 (Constraint satisfaction is sufficient in MAB) Assume as in Theorem 1. Let π be the output of Algorithm 5 with α = 0 and assume that a µ(a) ŵ(a) = 1. Then, for any δ > 0, the following holds with probability of at least 1 -δ J(π ⋆ ) -J(π) ≲ (B w (B v + 1) + αB f ) log|V||W|/δ N . Motivated by the discussion above, we take a step back and ask: are there any other methods for solving constrained optimization problems that find more constraint-satisfying solutions when applied to the empirical approximation of the original problem? A promising candidate is the augmented Lagrangian method (ALM) which adds a quadratic loss on the constraints to the objective. Applied to (1), ALM forms the following objective, whose empirical version leads to Algorithm 1. max w≥0 min v L MAB AL (w, v) := E a∼µ [w(a)r(a)] -v (E a∼µ [w(a)] -1) -(E a∼µ [w(a)] -1) 2 . (4) The following theorem establishes an upper bound on the suboptimality of the policy returned by Algorithm 1. This theorem is a special case of Theorem 3, whose proof is given in Appendix D.3. Theorem 2 (Suboptimality of Algorithm 1) Assume that 0 ≤ w(a) ≤ B w for any w ∈ W and |v|≤ B v for any v ∈ V. Further suppose realizability of any one of w ⋆ ∈ W and v ⋆ ∈ V and concentrability of π ⋆ = π w ⋆ . For any fixed δ > 0, policy π returned by Algorithm 1 achieves the following bound with probability of at least 1 -δ J(π ⋆ ) -J(π) ≲ B 2 w (B v + 1) log(|W||V|/δ) N . In the proof, we show that ALM results in near-validity of ŵ by ensuring that d ŵ = Ω(1), leading to the optimal rate. Note that Algorithm 1 does not include any explicit form of conservatism through regularizers or uncertainty quantifiers. Colloquially, the MIS formulation and boundedness of W elements ensure that d ŵ(a)/µ(a) = ŵ(a) ≤ B w and ALM ensures that d ŵ is close to the actual occupancy. Thus, Algorithm 1 seeks a policy whose actual occupancy is within data distribution. Algorithm 1 offers several benefits compared to PRO-MAB: it only requires π ⋆ -concentrability instead of the π ⋆ , π ⋆ α -concentrability requirement of PRO-MAB, removes the need to design regularization function f and adjust α, and does not introduce bias in the objective. The main advantage of ALM, however, becomes more evident as we move beyond bandits, where the behavior regularization provably fails to achieve the optimal statistical rate while ALM maintains optimality.

4. CONTEXTUAL BANDITS

The problem offline contextual bandits (CB) is a special case of offline RL with γ = 0 and offline dataset  D = {(s i , a i , r i )} N i=1 , where s i ∼ µ(•) = ρ(•), a i ∼ µ(• | a i ), (x)|≤ B f ′ . Assume 0 ≤ w(s, a) ≤ B w for w ∈ W, |v(s)|≤ B v , realizability w ⋆ , w ⋆ α ∈ W, v ⋆ , v ⋆ α ∈ V , and concentrability of π ⋆ , π ⋆ α . Let π be the output of Algorithm 6. Then, for N ≥ poly(δ, B w , B v , B f , B f ′ ) and any α ≥ 0 there exists a CB instance such that J(π ⋆ ) -J(π) ≳ N β with a constant probability, where β > -1/2. Algorithm 2 ALM with MIS (ALMIS) for offline CB 1: Inputs: Dataset D = {(s i , a i , r i )} N i=1 , function classes W, V 2: Find a solution ŵ, v to the following problem max w∈W min v∈V LCB AL (w, v) := 1 N N i=1 w(si, ai)(ri -v(si)) + v(si) - a∈A w(si, a)µ(a|si) -1 2 (8) 3: Return: π = π ŵ. Proposition 3 shows that behavior regularization is statistically suboptimal regardless of α. The proof is presented in Appendix D.2. The main takeaway is that ensuring occupancy validity a ŵ(s, a)µ(a|s) = Ω(1) for nearly all states appears to be critical in achieving the optimal rate. Yet, without introducing a large bias, behavior regularization is insufficient for such a guarantee.

4.2. OFFLINE CONTEXTUAL BANDITS WITH AUGMENTED LAGRANGIAN

To encourage occupancy validity, we extend ALM to CBs and propose the following objective: max w≥0 min v L CB AL (w, v) := E µ [w(s, a)r(s, a)] -E µ [v(s)(w(s, a) -1)] -E s∼µ [(E a∼µ(•|s) [w(s, a)] -1) 2 ] When |S|= 1, (7) simplifies to the ALM objective (2) for MABs. The ALM term can be understood as follows. Each element encourages the validity of occupancy in each state a w(s, a)µ(s, a) ≈ 1 and the elements are weighted according to the true state distribution: validity is more important in states that are actually more likely to be visited. Denote by w ⋆ an optimal solution to (7), which is equal to the optimal solution of ( 6), and define v ⋆ (s) := V ⋆ (s). The following theorem states that ALMIS achieves optimal rate in offline CBs, whose proof is in Appendix D.3. Theorem 3 (Suboptimality of Algorithm 2) Assume 0 ≤ w(s, a) ≤ B w for w ∈ W and v(s) ≤ B v for v ∈ V. Moreover, suppose realizability of any one of w ⋆ ∈ W and v ⋆ ∈ V and concentrability of π ⋆ = π w ⋆ . For any fixed δ ≥ 0, policy π returned by Algorithm 2 achieves the following suboptimality bound with probability of at least 1 -δ J(π ⋆ ) -J(π) ≲ B 2 w (B v + 1) log(|W||V|/δ) N .

5. MARKOV DECISION PROCESSES

We now turn to offline RL. In addition to the offline dataset, we assume access to a dataset D 0 = {s i } N i=1 with i.i.d. samples from the initial distribution ρ, similar to prior works (Lee et al., 2021; Zhan et al., 2022) . The linear programming formulation of RL (Puterman, 2014) solves max d≥0 E s,a∼d [r(s, a)] s.t. d(s) = (1 -γ)ρ(s) + γ s ′ ,a ′ P (s|s ′ , a ′ )d(s ′ , a ′ ) ∀s ∈ S. The constraints are known as Bellman flow equations and restrict the search to the space of valid occupancy distributions d π that can be induced in the MDP by running a policy π.

5.1. CONSERVATIVE OFFLINE RL WITH AUGMENTED LAGRANGIAN

Motivated by the success of ALM in bandits, we propose the following extension to offline RL: max w≥0 min v L MDP AL (w, v) := (1 -γ)E ρ [v(s)] + E µ [w(s, a)e v (s, a)] -E d πw d w (s) d πw (s) -1 2 (10) where e v (s, a) := r(s, a) + γ s ′ P (s ′ |s, a)v(s ′ ) -v(s). One can check that the first two terms are the Lagrange dual of (9) and the last term is a generalization of the ALM terms in bandits. The ALM elements encourage occupancy d w (s) to be close in ratio to the actual occupancy d πw (s) in each state and as before, the ALM elements are weighted according to their actual visitation d πw (s). Our particular ALM construction can be intuitively understood as follows. The MIS formulation learns bounded weights ŵ(s, a) = d ŵ(s, a)/µ(s, a) ≤ B w . The ALM term ensures that the ratio d ŵ(s)/d π ŵ (s) = Ω(1) which roughly translates to d π ŵ (s, a)/µ(s, a) ≲ B w . The ALM term in ( 10) is difficult to estimate as it involves the expectation over unknown occupancy d πw and the computation of the ratio d w (s)/d πw (s). We resolve this difficulty in the next sections. Published as a conference paper at ICLR 2023 Algorithm 3 ALM with MIS (ALMIS) for offline RL -Model-based 1: Inputs: Datasets D, D 0 , D m , function classes W, V, U, P, f -1 * (x) = 2 √ x + 1 -2. 2: Estimate transitions via maximum likelihood: P = argmax P ∈P Nm i=1 ln P (s ′ i |s i , a i ). 3: Find a solution ŵ, v, û to the following problem max w∈W min v∈V min u∈U Lmodel-based AL (w, v) := (1 -γ) N 0 N0 i=1 v(s i ) + a u(s i , a)π w (a|s i ) + 1 N N i=1 w(s i , a i ) r i + γv(s ′ i ) -v(s i ) -f -1 * u(s i , a i ) -γ( Pπw u)(s i , a i ) 4: Return: π = π ŵ.

5.2. ESTIMATING THE ALM TERM AND ALMIS ALGORITHMS FOR OFFLINE RL

We view the ALM term as the negative f -divergence between d w and d πw with f (x) := (x -1) 2 and express it in the variational form (Nguyen et al., 2010) : -E d πw d w (s) d πw (s) -1 2 = -D f (d w ∥d πw ) = min x E d πw [f * (x(s, a))] -E dw [x(s, a)]. Here, f * is the convex conjugate of f and we used the fact that d w (s, a)/d πw (s, a) = d w (s)/d πw (s). Notice that E d πw [f * (x(s, a))] is the value of π w in the same MDP but with rewards f * (x(s, a)). Define u as the fixed point of the following Bellman equation u(s, a) := f * (x(s, a)) + γ(P πw u)(s, a). (12) Since u(s, a) is the Q-function of π w with rewards f * (x(s, a)), we can rewrite (11) as (11) = min u (1 -γ)E s∼ρ,a∼πw [u(s, a)] -E µ w(s, a)f -1 * (u(s, a) -γ(P πw u)(s, a)) . Equation ( 13) involves expectations over ρ and µ, which can be estimated empirically. Below, we discuss model-free and model-based methods for estimating the term involving the transition operator P πw . We include some details on practical implementations in Appendix E.1. Model-based ALMIS. For the model-based route, we assume access to a class P that contains the true transitions and an additional dataset D m = {(s i , a i , s ′ i )} Nm i=1 , where s i , a i ∼ µ and s ′ i ∼ P (.|s i , a i ). Given D m , we obtain a maximum likelihood estimate of transitions and approximate the expectations using D 0 and D, which leads to model-based ALMIS for offline RL (Algorithm 3). Model-free ALMIS. As an alternative, we consider developing a model-free that uses a singlesample estimate of f -1 * (u(s, a) -γ(P πw u)(s, a)). This, however, roughly leads to the infamous double sampling problem (Baird, 1995) . To circumvent this difficulty, in Appendix E.2 we use the dual embedding trick of Nachum et al. (2019a) , to derive model-free ALMIS (Algorithm 4). Theorem 4 shows that ALMIS for offline RL enjoys optimal rates; see Appendix E.3 for the proof. Theorem 4 (Suboptimality of ALMIS for offline RL) Assume 0 ≤ w(s, a) ≤ B w for w ∈ W, |v(s)|≤ B v for v ∈ V, and |u(s, a)|≤ B u . Suppose realizability of any one of w ⋆ ∈ W and v ⋆ (s) = V ⋆ (s) ∈ V and concentrability of π ⋆ = π w ⋆ . Let xw (s, a) = clip(x ⋆ w (s, a), -B x , B x ), where x ⋆ w is a solution to (11) and B x = (1 -γ)/4, and define u ⋆ w as the fixed-point solution to (12) when x = xw . Assume u ⋆ w ∈ U for any w ∈ W. Then, B u satisfies (1 -γ) -1 (B 2 x /4 + B x ) ≤ B u ≤ 1 2 . Moreover, for any fixed δ ≥ 0, the following statements hold: (I) Assume N = N 0 = N m for simplicity. If P ⋆ ∈ P, then π returned by Algorithm 3 achieves J(π ⋆ ) -J(π) ≲ B v + B u + (1 + B v )B w (1 -γ) 3 B u log(|P||U||W||V|/δ) N . (II) Assume N = N 0 for simplicity. Let ζ ⋆ w,u = argmax ζ<0 L model-free AL (w, v, u, ζ) defined in (54). Assume ζ ⋆ w ⋆ ,u ∈ Z for u ∈ U and B ζ,L ≤ |ζ(s, a)|≤ B ζ,U for ζ ∈ Z, where B ζ,L ∈ (0, 2/(2 + B x )) and B ζ,U ≥ 2/(2 -B x ). Let B ζ = max{B ζ,U , B -1 ζ,L }. Then, π returned by Algorithm 4 achieves J(π ⋆ ) -J(π) ≲ B v + B u + (1 + B v + B ζ (B u + 1))B w (1 -γ) 3 log(|U||W||V||Z|/δ) N . Algorithm 4 ALM with MIS (ALMIS) for offline RL -Model-free 1: Inputs: Datasets D, D 0 , function classes W, V, U, Z, g * (x) = -x -2 -1 x . 2: Find a solution ŵ, v, û, ζ to max w∈W min v∈V min u∈U max ζ∈Z Lmodel-free AL (w, v, u, ζ) defined as (1 -γ) N 0 N0 i=1 v(s i ) + a u(s i , a)π w (a|s i ) + 1 N N i=1 w(s i , a i ) r i + γv(s ′ i ) -v(s i ) + ζ(s i , a i ) u(s i , a i ) -γ a ′ ∈A u(s ′ i , a ′ )π w (a ′ |s ′ i ) -g * (ζ(s i , a i )) 3: Return: π = π ŵ. In Theorem 4, we make realizability assumptions on u ⋆ w for w ∈ W and ζ ⋆ w ⋆ ,u for u ∈ U. Such assumptions are common in theory of RL with function approximation (Munos & Szepesvári, 2008; Xie et al., 2021; Jiang & Huang, 2020) and removing them can be difficult or impossible (Foster et al., 2021) . Recently, Zhan et al. (2022) ; Chen & Jiang (2022) propose algorithms that only require optimal solution realizability, however, these algorithms are either intractable or suboptimal. 

5.3. EXAMPLE: BEHAVIOR REGULARIZATION VS. AUGMENTED LAGRANGIAN

A C B +1 +0 +1 +1 +0 R L L R (A, L) = 1/4, µ(A, R) = 1/2, µ(B) = 1/4, µ(C) = 0, which satis- fies π w1 -concentrability. We examine the MDP example in Figure 1 presented by Zhan et al. (2022) . Assume V = {v ⋆ } and W = {w 1 , w 2 }, where w 1 always selects L from A and w 2 always selects R from A. One can check w 1 (A, L) = 2, w 1 (A, R) = 0 and w 2 (A, L) = 0, w 2 (A, R) = 1. Unregularized algorithm. As Zhan et al. ( 2022) state, the unregularized algorithm fails to distinguish between w 1 and w 2 even with infinite data as the objectives at w 1 and w 2 are exactly equal. Behavior regularization. Consider an instantiation of PRO-RL with regularizer -αE µ [w 2 (s, a)]. Since in this example E µ [w 2 1 (s, a)] > E µ [w 2 2 (s, a)], PRO-RL picks the wrong weight w 2 , suffering constant suboptimality. Note, however, that PRO-RL guarantees assume π ⋆ α -concentrability. Intuitively, behavior regularization causes π ⋆ α to be more stochastic and thus requiring µ(s, a) > 0 for more states and actions. Here, since µ covers both (A, L) and (A, R), behavior regularization causes π ⋆ α (R|A) > 0 and thus d π ⋆ α (C) > 0. To handle the MDP in Figure 1 , PRO-RL additionally requires µ(C) > 0 to satisfy π ⋆ α -concentrability. ALM. In this example, ALM successfully picks the optimal w 1 , as it avoids a mismatch between the actual and learned occupancies. This is because in (10) the ALM term is zero at w 1 due to realizability whereas at w 2 , it has a lower bound E s∼d π 2 (d w2 (C)/d π2 (C) -1) 2 ≥ d π2 (C) > 0.

6. DISCUSSION

We present a set of practical and statistically optimal algorithms for offline MAB, CB, and RL, under general function approximation and single-policy concentrability. Our algorithms are designed within the MIS formulation combined with a novel application of augmented Lagrangian method. Importantly, our optimality guarantees hold under MIS combined with ALM alone, without any additional form of conservatism such as via regularization or uncertainty quantification. Furthermore, we investigate the role of regularizers in MIS algorithms. Although the empirical benefits of such regularizers are often attributed to conservatism, our analysis suggests that conservatism stems from the MIS formulation while the role of regularizers is to ensure the validity of learned occupancy. Interesting future directions include conducting empirical evaluations of ALM, examining the possibility of removing strong realizability assumptions, and investigating practical and optimal offline RL algorithms whose guarantees hold under milder variants of single-policy concentrability more suited to function approximation. A RESULTS SUMMARY (Algorithm 5) -αE µ [f (w(a))] (α ≍ 1/ √ N ) two-policy π ⋆ , π ⋆ α O 1 √ N (Theorem 1) MIS + ALM (Algorithm 1) (E µ [w(a)] -1) 2 single-policy π ⋆ O 1 √ N (Theorem 2) CB MIS + behavior reg. (Algorithm 6) -αE µ [f (w(s, a))] (α ≥ 0) two-policy π ⋆ , π ⋆ α Ω(N β≥-1 2 ) (Proposition 3) MIS + ALM (Algorithm 2) E µ ( a w(s, a)µ(a|s) -1) 2 single-policy π ⋆ O 1 √ N (Theorem 3) MDP MIS + ALM (Algorithms 3, 4) E d πw dw(s) d πw (s) -1 2 single-policy π ⋆ O 1 √ N (Theorem 4) B RELATED WORK We covered a number of related works in the introduction and throughout the paper. In this section, we review more related literature.

B.1 CONCENTRABILITY ASSUMPTIONS

The lack of sufficient coverage in the offline dataset is one of the main challenges in offline RL. In RL theory, dataset coverage has often been characterized by concentrability definitions (Munos, 2007; Scherrer, 2014) . Earlier works on offline RL impose all-policy concentrability on the density ratio for all states and actions (Scherrer, 2014; Liu et al., 2019a; Chen & Jiang, 2019; Jiang, 2019; Wang et al., 2019; Liao et al., 2020; Zhang et al., 2020a) , with some requiring this ratio to be bounded for every time step (Szepesvári & Munos, 2005; Munos, 2007; Antos et al., 2008; Farahmand et al., 2010; Antos et al., 2007) 2022), which requires single-policy concentrability on density ratio for all states and actions.

B.2 CONSERVATIVE OFFLINE RL

A series of recent works on offline RL have focused on addressing partial coverage of offline dataset through conservative algorithm design. Broadly speaking, these methods can be broken down into several categories. The first category of methods applies policy constraints, enforcing the learned policy to be close to the behavior policy. Such constraints are applied either explicitly (Fujimoto et al., 2019; Ghasemipour et al., 2020; Jaques et al., 2019; Siegel et al., 2020; Kumar et al., 2019; Wu et al., 2019; Fujimoto & Gu, 2021) , implicitly (Peng et al., 2019; Nair et al., 2020) , or through importance sampling (Liu et al., 2019b; Swaminathan & Joachims, 2015; Nachum et al., 2019b; Lee et al., 2021; Zhang et al., 2020c; b) . Another category involves learning conservative values such as conservative Q-learning (Kumar et al., 2020) , fitted Q-iteration with conservative update (Liu et al., 2020) , subtracting penalties (Rezaeifar et al., 2022) , and critic regularization (Kostrikov et al., 2021) . The last category includes model-based methods such as learning pessimistic models (Kidambi et al., 2020; Guo et al., 2022) , adversarial model learning (Rigter et al., 2022) , forming penalties using model ensembles (Yu et al., 2020) , or incorporating a combination of model and values (Yu et al., 2021) . On the theoretical side, as discussed in the introduction, the majority of works design pessimistic offline RL algorithms that rely on some form of uncertainty quantification (Yin & Wang, 2021; Uehara et al., 2021; Zhang et al., 2022; Yan et al., 2022; Yin et al., 2022; Kumar et al., 2021; Shi & Chi, 2022; Wang et al., 2022) 2021) presents a pessimistic modelfree algorithm under a variant of single-policy concentrability framework that requires a bounded ratio of average Bellman error and Bellman completeness. While the original version of the algorithm achieves the optimal 1/ √ N rate, it is computationally intractable. A practical version of the algorithm is presented and has a suboptimal 1/N 1/5 guarantee. Another related work by Chen & Jiang (2022) studies MIS combined with value function approximation under π ⋆ -concentrability and proves a 1/ gap(Q ⋆ )N rate, yet the guarantee degrades with Q ⋆ gap and the algorithm is computationally intractable. Cheng et al. (2022) propose an adversarially trained actor-critic method that enjoys provable 1/N 1/3 rate under the single-policy concentrability definition of Xie et al. (2021) and Bellman completeness and performs well in offline RL benchmarks when combined with deep neural networks. 

B.3 OTHER TOPICS

Apart from RL, our work on bandits is related to the selection problem (Hong et al., 2021) , though the majority of works in this area are in the online setting. Additionally, in our analysis, we solve a subset of stochastic optimization problems with possibly large or infinite stochastic constraints involving conditional expectations. To our knowledge, finite-sample properties of such stochastic optimization problems have not been addressed (Shapiro et al., 2021) and our work may open up avenues for further research in this area.

C SUPPLEMENTARY MATERIALS FOR MULTI-ARMED BANDITS

We start by presenting a pseudocode for the behavior regularized MIS algorithm in Appendix C.1. In Appendix C.2, we characterize the bias caused by adding the behavior regularization in the primaldual objective (2). In Appendix C.3, we prove Proposition 1 that demonstrates the failure of unregularized MIS for solving offline MABs, even when the optimal solutions are realizable and an optimal policy is covered in the offline data. Appendix C.4 is devoted to the proof of Theorem 1, which gives a tight performance upper bound of the PRO-MAB algorithm. Finally in Appendix C.5, we prove Proposition 2, showing that constraint satisfaction is sufficient for the success of unregularized MIS.

C.1 PRIMAL-DUAL REGULARIZED OFFLINE MULTI-ARMED BANDITS (PRO-MAB)

Algorithm 5 Primal-dual Regularized Offline Multi-Armed Bandits (PRO-MAB) 1: Inputs: Dataset D = {(a i , r i )} N i=1 , classes V = [-B v , B v ] and W, function f (•), parameter α. 2: Find a solution ŵ, v to the following problem max w∈W min v∈V LMAB α (w, v) := 1 N N i=1 w(a i )r i -αf (w(a i )) -v(w(a i ) -1). ( ) 3: Return: π(a) = ŵ(a)µ(a) a ŵ(a)µ(a) if a ŵ(a)µ(a) > 0, and 1 |A| otherwise.

C.2 SOLUTIONS TO THE PRIMAL-DUAL REGULARIZED OBJECTIVE

In the following lemma, we characterize the optimal solution (w ⋆ α , v ⋆ α ) to the behavior-regularized population objective (2) as well as the suboptimality of the policy induced by w ⋆ α . Lemma 1 (Regularized primal-dual solutions, MAB) Let f be differentiable, strictly convex, nonnegative, and bounded by B f . Denote r ⋆ := max a∈A r(a). Then, the following statements hold: (I) w ⋆ 0 = w ⋆ , where w ⋆ is the importance weight corresponding to an optimal policy; (II) v ⋆ α = r ⋆ -cα, where 0 ≤ c ≤ f ′ (C ⋆ ); (III) policy π ⋆ α := π w ⋆ α satisfies J(π ⋆ ) -J(π ⋆ α ) ≤ αB f . Proof. Part (I) follows directly by strong duality. For part (II), notice that KKT conditions imply the following relation between w ⋆ α (a) and v ⋆ α : w ⋆ α (a) = max 0, (f ′ ) -1 r(a) -v ⋆ α α . Since f is strictly convex, f ′ is a monotonically increasing function. Therefore, the optimal arm a ⋆ has the largest w ⋆ α (a), which should be nonzero due to realizability of w ⋆ α . In other words, w ⋆ α (a ⋆ ) = (f ′ ) -1 r ⋆ -v ⋆ α (s) α ⇒ v ⋆ α = r ⋆ -αf ′ (w ⋆ α (a ⋆ )). ( ) We now proceed to find a bound on f ′ (w ⋆ α (a ⋆ )). Since w ⋆ α is the optimal solution to (2), it must satisfy the constraint a∈A µ(a)w ⋆ α (a) = 1 ⇒ w ⋆ α (a ⋆ ) ≤ 1 µ(a ⋆ ) ≤ C ⋆ , where the last inequality stems from the single-policy concentrability assumption of π * . Since f ′ is an increasing function, we have f ′ (w ⋆ α (a ⋆ )) ≤ f ′ (C ⋆ ), which combined with (17) yields the following lower bound on v ⋆ α v ⋆ α ≥ r ⋆ -αf ′ (C ⋆ ). Moreover, the convexity of f immediately gives the upper bound on v ⋆ α ≤ r ⋆ , which completes the proof of part (II). We now prove the last part. Since w ⋆ α is the optimal solution to the regularized population objective (2), by strong duality, we have E a∼d w ⋆ α [r(a)] -αE a∼µ [f (w ⋆ α (a))] ≥ E a∼d ⋆ [r(a)] -αE a∼µ [f (w ⋆ (a))] where d w ⋆ α (a) = µ(a)w ⋆ α (a) by definition given in Section 2 and we used the fact that E a∼µ [w ⋆ α (a)] -1 = E a∼µ [w ⋆ (a)] -1 = 0. Therefore, the suboptimality of π ⋆ α can be bounded as follows J(π ⋆ ) -J(π ⋆ α ) = E a∼d ⋆ [r(a)] -E a∼d w ⋆ α [r(a)] ≤ αE a∼µ [f (w ⋆ (a))] -αE a∼µ [f (w ⋆ α (a))] ≤ αE a∼µ [f (w ⋆ (a))] ≤ αf (C ⋆ ) ≤ αB f , where in the second to last inequality we used the non-negativity of f and in the last equality, we used the boundedness of f . □

C.3 PROOF OF PROPOSITION 1

Consider a 2-armed bandit instance with the following reward distributions, data distribution, and function classes. • Reward distributions: The first arm is optimal with deterministic reward and the second arm has a Bernoulli distribution: r(1) = 1 2 w.p. 1, r(2) ∼ Bernoulli(1/3). • Data distribution: We consider a scenario where most data are concentrated on the optimal arm: µ(1) = 1 - 2 N , µ(2) = 2 N . Here, the single-policy concentrability coefficient is C ⋆ = 1/µ(1) and is finite for N > 2. Let N (a) denote the number of samples on arm a. To obtain upper and lower bounds on N (a), we resort to the following lemma, which is a direct consequence of the Chernoff bound for binomial variables. Lemma 2 (Chernoff bounds, binomial) (I) With probability at least 1 -exp(-N µ(a)δ 2 u /(2 + δ u )), one has N (a) ≤ (1 + δ u )N µ(a) for any δ u > 0; (II) With probability at least 1 -exp(-N µ(a)δ 2 l /2), one has (1 -δ l )N µ(a) ≤ N (a) for any 0 < δ l < 1. We condition on the event that the number of samples on the second arm is between 1 and 5 which occurs with probability larger than 1 -exp -2 • 0.9 2 2 -exp -2 • 1.95 2 1.95+2 ≥ 0.4 due to Lemma 2 when setting δ l = 0.9 and δ u = 1.95: (1 -0.9) • N µ(2) ≤ N (2) ≤ (1 + 1.95) • N µ(2) ⇒ 1 ≤ N (2) ≤ 5. • Function classes: Assume that W = {w 1 = (C ⋆ , 0), w 2 = (0, B w )} and V = {1/2}. By Lemma 1, we have v ⋆ 0 = r ⋆ = 1/2. Therefore, the problem is realizable as v ⋆ 0 ∈ V and w ⋆ 0 = w ⋆ = (C ⋆ , 0) ∈ W. Furthermore, notice that for the second candidate w 2 = (0, B w ) ∈ W, the normalization factor is small for a constant B w as d w2 = a w 2 (a)µ(a) = 2B w /N . Consider the case where all N (a) samples on the second arm observe a reward of 1, which happens with a probability of at least 1 3 5 as we conditioned on the event that 1 ≤ N (2) ≤ 5. We now compute ŵ by solving the empirical objective ( 16) with α = 0. Note that since |V|= 1, it suffices to compute ŵ = arg max w∈W LMAB 0 (w, v = 1/2). We have LMAB 0 (w 1 , 1/2) = N (1) N C ⋆ • 1 2 - 1 2 (C ⋆ -1) + N (2) 2N = 1 2 LMAB 0 (w 2 , 1/2) = N (1) 2N + N (2) N B w - 1 2 (B w -1) = 1 2 + N (2)B w 2N Since we conditioned on the event with N (2) ≥ 1, solving the optimization problem max w∈W LMAB 0 (w, v = 1/2) finds ŵ = (0, B w ), leading to a policy that picks the second arm with probability one. Therefore, with constant probability of 0.4 × 1/3 5 > 0.001, we have J(π ⋆ ) -J(π) = 1 2 - 1 3 = 1 6 .

C.4 PROOF OF THEOREM 1

Before embarking on the main proof, we present two lemmas related to the primal-dual regularized approach. The first lemma shows the closeness of population objective (2) to its empirical approximation used in Algorithm 1, which is a direct consequence of Hoeffding's inequality. We also show that closeness of objectives results in the closeness of w ⋆ α and ŵ, which are respectively the optimums to (2) and ( 16). The proof of this lemma is deferred to the end of this subsection. Lemma 3 (Empirical and population closeness, PRO-MAB) Fix δ > 0 and define ϵ MAB stat,α := ((B w + 1)(B v + 1) + αB f ) log|V||W|/δ N . For any w ∈ W and v ∈ V, the following bounds hold with probability at least 1 -δ (I) |L MAB α (w, v) -LMAB α (w, v)|≤ ϵ MAB stat,α ; (II) L MAB α (w ⋆ α , v) -L MAB α ( ŵ, v) ≤ 2ϵ MAB stat,α . The second lemma finds a lower bound on the occupancy normalization factor d ŵ = a ŵ(a)µ(a) enforced by the behavior regularization. Lemma 4 (Occupancy validity enforced by behavior regularization) Let f be an M f -stronglyconvex function and fix δ > 0. Then, with probability at least 1 -δ, one has d ŵ ≥ 1 - 4ϵ MAB stat,α αM f , where ϵ MAB stat,α is defined in (18). For the rest of this proof, we condition on the high probability events of Lemmas 3 and 4. Define ϵ ŵ,r := a w ⋆ α (a)µ(a)r(a) -ŵ(a)µ(a)r(a). By part (II) of Lemma 3, we have L MAB α (v ⋆ α , w ⋆ α ) -L MAB α (v ⋆ α , ŵ) ≤ 2ϵ MAB stat,α . Therefore, ϵ ŵ,r -αE µ [f (w ⋆ α (a)) -f ( ŵ(a))] + v ⋆ α (d ŵ -1) ≤ 2ϵ MAB stat,α . Recall from Lemma 1 that we have v ⋆ α = r ⋆ -αc, where c ≤ f ′ (C ⋆ ). Thus, combined with (20), we write ϵ ŵ,r + r ⋆ (d ŵ -1) ≤ 2ϵ MAB stat,α + αE µ [f (w ⋆ α (a)) -f ( ŵ(a))] + αc(d ŵ -1) ≤ 2ϵ MAB stat,α + α(2B f + αf ′ (C ⋆ )B w ), where in the second line we used the bounds |f (x)|≤ B f and d ŵ ≤ B w . Note that setting α = 16ϵ MAB stat,1 /M f , Lemma 4 asserts that d ŵ ≥ 1/2. Since d ŵ ≥ 1/2, the learned policy is written as π = ŵ(a)µ(a)/d ŵ. With simple algebraic manipulations, we find the following expression for the suboptimality of π with respect to π ⋆ α : J(π ⋆ α ) -J(π) = a w ⋆ α (a)µ(a)r(a) - 1 d ŵ ŵ(a)µ(a)r(a) = a w ⋆ α (a)µ(a)r(a) -ŵ(a)µ(a)r(a) + a 1 - 1 d ŵ ŵ(a)µ(a)r(a) = ϵ ŵ,r + (d ŵ -1) a 1 d ŵ ŵ(a)µ(a)r(a) = ϵ ŵ,r + (d ŵ -1) J(π) = ϵ ŵ,r + (d ŵ -1) J(π ⋆ α ) -(d ŵ -1) [J(π ⋆ α ) -J(π)] . Let ϵ reg = J(π ⋆ ) -J(π ⋆ α ) = r ⋆ -J(π ⋆ α ) denote the suboptimality suffered due to behavior regularization. Suboptimality J(π ⋆ α ) -J(π) can be expressed as J(π ⋆ α ) -J(π) = 1 d ŵ (ϵ ŵ,r + (d ŵ -1) J(π ⋆ α )) = 1 d (ϵ ŵ,r + (d ŵ -1) (r ⋆ -ϵ reg )) ≤ 1 d ŵ (ϵ ŵ,r + (d ŵ -1) r ⋆ ) - 1 d ŵ (d ŵ -1) ϵ reg . We use the above inequality to bound the suboptimality with respect to an optimal policy: J(π ⋆ ) -J(π) = J(π ⋆ ) -J(π ⋆ α ) + J(π ⋆ α ) -J(π) = ϵ reg + J(π ⋆ α ) -J(π) ≤ ϵ reg + 1 d ŵ (ϵ ŵ,r + (d ŵ -1) r ⋆ ) - 1 d ŵ (d ŵ -1) ϵ reg ≤ 1 d ŵ (ϵ ŵ,r + (d ŵ -1) r ⋆ ) + 1 d ŵ ϵ reg . Recall that we have 1/d ŵ ≤ 2 and that ϵ reg is bounded by αB f by Lemma 1. Therefore, J(π ⋆ ) -J(π) ≤ 1 d ŵ (ϵ ŵ,r + (d ŵ -1) r ⋆ ) + 1 d ŵ ϵ reg ≤ 2 (ϵ ŵ,r + (d ŵ -1) r ⋆ ) + 2αB f ≤ 4ϵ MAB stat,α + α(4B f + 2f ′ (C ⋆ )B w )) + 2αB f ≲ α(B f + f ′ (C ⋆ )B w ). where the penultimate inequality relies on the bound derived in (21). Proof of Lemma 3. LMAB α (w, v) is an empirical average over independent and bounded random variables, where the bound on individual variables is computed as |w(a i )r i -αf (w(a i )) -v(w(a i ) -1)| ≤ B w + αB f + B v (B w + 1) ≤ (B w + 1)(B v + 1) + αB f . It is easy to see that E D [ LMAB α (w, v)] = L MAB α (w, v) , where the expectation is taken with respect to the randomness in dataset D. Part (I) of this lemma is proved by applying Hoeffding's inequality along with a union bound on w and v. The proof of part (II) is similar to Lemma 7 of Zhan et al. (2022) and relies on decomposing the objective difference and using the fact that ( ŵ, v) correspond to the saddle points of L MAB  We write L MAB α (w ⋆ α , v) -L MAB α ( ŵ, v) = L MAB α (w ⋆ α , v) -L MAB α (w ⋆ α , vw ⋆ α ) :=T 1 + L MAB α (w ⋆ α , vw ⋆ α ) -LMAB α (w ⋆ α , vw ⋆ α ) :=T 2 + LMAB α (w ⋆ α , vw ⋆ α ) -LMAB α ( ŵ, v) :=T 3 + LMAB α ( ŵ, v) -LMAB α ( ŵ, v) :=T 4 + LMAB α ( ŵ, v) -L MAB α ( ŵ, v) :=T 5 , Each term is bounded as follows: • T 1 = 0 because w ⋆ α satisfies the constraint a w ⋆ α (a)µ(a) = 1 and for any v 1 , v 2 we have L MAB α (w ⋆ α , v 1 ) = L MAB α (w ⋆ α , v 2 ). • T 2 ≤ ϵ stat due to Lemma 3. • T 3 ≤ 0 because ŵ = arg max w∈W Lα (v w , w). • T 4 ≤ 0 because v = arg min v∈V LMAB α (v, ŵ). • T 5 ≤ ϵ stat due to Lemma 3. Summing up the bounds on each term yields the desired bound. □ Proof of Lemma 4. This lemma is a direct consequence of Lemma 8 in Zhan et al. (2022) . For completeness, we present a simplified proof for the multi-armed bandit setting. First, observe that since f is M f -strongly-convex, the function L MAB α (v ⋆ α , w) is αM f -strongly- concave with respect to w and norm ∥•∥ 2,µ . Furthermore, since w ⋆ α = arg max w L MAB α (v ⋆ , w), we have ∥ ŵ -w ⋆ α ∥ 2,µ ≤ 2(L MAB α (w ⋆ α , v ⋆ α ) -L MAB α ( ŵ, v ⋆ α )) αM f . The above bound along with the bound on L MAB α (w ⋆ α , v ⋆ α ) -L MAB α ( ŵ, v ⋆ α ) ≤ 2ϵ MAB stat,α showed in Lemma 1, give the following bound on |d ŵ -1| |d ŵ -1| = a ŵ(a)µ(a) - a w ⋆ α (a)µ(a) ≤ ∥ ŵ -w ⋆ α ∥ 1,µ ≤ ∥ ŵ -w ⋆ α ∥ 2,µ ≤ 4ϵ MAB stat,α αM f , which completes the proof. □

C.5 PROOF OF PROPOSITION 2

Consider the difference between population objective with α = 0 at w ⋆ 0 = w ⋆ and ŵ, which is bounded by Lemma 3: Substituting the expression for ϵ MAB stat,α from ( 18) with α = 0, we obtain L(w ⋆ , v ⋆ ) -L( ŵ, v ⋆ ) = E a∼µ [r(a)(w ⋆ (a) -ŵ(a))] -v ⋆ E a∼µ [w ⋆ (a) -ŵ(a)] ≲ ϵ MAB stat,α . J(π ⋆ ) -J(π) = E a∼µ [r(a)(w ⋆ (a) -ŵ(a))] ≲ B w (B v + 1) log|V||W|/δ N , where we used the fact that B w ≍ B w + 1 since B w ≥ 1 due to realizability of w ⋆ .

D PROOFS FOR CONTEXTUAL BANDITS

This section of the appendix is organized as follows. In Appendix D.1, we present details of the PRO-CB algorithm. Appendix D.2 is devoted to the proof of Proposition 1, which shows that the PRO-CB algorithm fails to achieve the statistically optimal rate of 1/ √ N . The proof of suboptimality upper bound for the conservative offline CB algorithm with ALM is presented in Theorem 3.

D.1 PRIMAL-DUAL REGULARIZED OFFLINE CONTEXTUAL BANDITS (PRO-CB)

Define importance weights w(s, a) = d(s, a)/µ(s, a) to denote the ratio of occupancy and data distribution. The primal-dual regularized approach (Zhan et al., 2022) solves the following population objective  The above optimization problem satisfies strong duality. We define w ⋆ α , v ⋆ α to respectively denote the optimal solutions to the primal and dual variables. Approximating w, v to belong to function classes W, V and solving the empirical version of objective ( 24) leads to the PRO-CB given in Algorithm 6. Algorithm 6 Primal-dual Regularized Offline Contextual Bandits (PRO-CB) 1: Inputs: Dataset D = {(s i , a i , r i )} N i=1 , function classes W, V, function f (•), parameter α 2: Find a solution ŵ, v to the following problem max w∈W min v∈V LCB α (w, v) := 1 N N i=1 w(s i , a i )r i -αf (w(s i , a i )) -v(s i )(w(s i , a i ) -1). ( ) 3: Return: π = π ŵ.

D.2 PROOF OF PROPOSITION 3

We separate the proof into two cases: α ≥ N β for β > -1/2 and α ≤ O(N -1/2 ). When α is large, we show that the large bias caused by regularization results in suboptimality of α even in MABs. When α is small, we construct a two-state CB instance (as the single-state case is indeed successful due to Theorem 1), showing that such small α does not sufficiently enforce occupancy validity in states with a relatively small but still significant state distribution ρ(s).

D.2.1 PROOF FOR LARGE α

If there exists -1 2 < β such that α ≥ N β , then we consider a simple single-state two-arm contextual bandit (equivalently multi-armed bandit) instance: • Reward distribution: Both arms have deterministic rewards and the suboptimal arm has a value gap of α: r(1) = 1 w.p. 1, r(2) = max{0, 1 -α} w.p. 1. • Data distribution: We construct the data distribution such that both arms have constant probability density, which implies a constant concentrability ratio C ⋆ . Here we assume M f < 100 for convenience, but if M f is larger we can use the same construction with an even larger constant as the denominator. µ(1) = M f 100 , µ(2) = 1 - M f 100 . • Function classes: We assume both W and V contain only the optimal regularized solutions (w ⋆ α , v ⋆ α ) and the optimal unregularized solutions (w ⋆ , v ⋆ ), which satisfy the realizability requirements of PRO-CB: W = {w ⋆ α , w ⋆ }, V = {v ⋆ α , v ⋆ }. Our argument is broken down in two steps. In the first step, we show that the suboptimality of the optimal regularized policy, which is the policy induced by the regularized optimal weights π ⋆ α := π w ⋆ α , is at least of order min{1, α}. Then, in the second step, we prove that w ⋆ α is chosen with a constant probability. Step 1: Suboptimality of π ⋆ α . In the particular offline bandit instance above, we show the following lower bound on suboptimality of π ⋆ α J(π ⋆ ) -J(π ⋆ α ) = π ⋆ α (2) • (r(1) -r(2)) = µ(2)w ⋆ α (2) • min{1, α} = Ω(min{1, α}). ( ) To establish (26), we show that w ⋆ α (2) > c for a fixed constant c = 1 2 . We prove this by contradiction. Suppose w ⋆ α (2) ≤ c. ( ) By KKT conditions we have w ⋆ α (2) = max 0, (f ′ ) -1 r(2) -v ⋆ α α ≥ (f ′ ) -1 r(2) -v ⋆ α α . Therefore, using the fact that f ′ is strictly increasing since f is strictly convex, we lower bound v ⋆ α according to v ⋆ α ≥ r(2) -αf ′ (w ⋆ α (2)) ≥ r(1) -(r(1) -r(2)) -αf ′ (c) . Combining the above bound on v ⋆ α with the KKT condition on w ⋆ α (1), we then obtain w ⋆ α (1) =(f ′ ) -1 r(1) -v ⋆ α α ≤ (f ′ ) -1 r(1) -r(2) α + f ′ (c) . ( ) Here, we used the fact that v ⋆ α ≥ r ⋆ = r(1) and that f (0) = 0 so (f ′ ) -1 ((r(1) -v ⋆ α )/α) ≥ 0. Moreover, since the regularization function f is M f -strongly convex, we write f ′ 1 -cµ(2) µ(1) -f ′ (c) ≥ M f 1 -cµ(2) µ(1) -c = M f 1 -c µ(1) = 100(1 -c) > 1, ⇒ r(1) -r(2) α + f ′ (c) ≤ 1 + f ′ (c) < f ′ 1 -cµ(2) µ(1) . Therefore, we can continue to upper bound the RHS of ( 28): 1) , which further implies that w ⋆ α (1) ≤ (f ′ ) -1 r(1) -r(2) α + f ′ (c) < by (29) (f ′ ) -1 f ′ 1 -cµ(2) µ(1) = 1 -cµ(2) µ( w ⋆ α (1)µ(1) < 1 -cµ(2) ≤ by (27) 1 -w ⋆ α (2)µ(2) ⇒ a w ⋆ α (a)µ(a) < 1. ( ) Note that (30) contradicts with the fact that (w ⋆ α , v ⋆ α ) is the optimal min-max solution of L MAB α because it violates the constraint E µ [w(a)] = 1. Therefore, ( 27) should not hold in the first place, and we must have J(π ⋆ ) -J(π ⋆ α ) = µ(2)w ⋆ α (2) • (r(1) -r(2)) > c 1 - M f 100 min{1, α} ≳ min{1, α} Step 2: w ⋆ α is picked with large probability. We now show that w ⋆ α is picked by the algorithm with at least a constant probability. Note that since w ⋆ α and w ⋆ both satisfy the constraint E µ [w]-1 = 0, objectives L MAB α (w ⋆ α , v) and L MAB α (w α , v) do not depend on the Lagrange multiplier variable v. We argue that at the population level, we have the following lower bound on the gap L MAB α (w ⋆ α , v) - L MAB α (w ⋆ , v) ≳ α. Using the definition of L MAB α , one has L MAB α (w ⋆ α , •) -L MAB α (w ⋆ , •) =αE µ [f (w ⋆ (a)) -f (w ⋆ α (a))] -µ(2)w ⋆ α (2)(r(1) -r(2)) =α µ(1) f (w ⋆ (1)) -f (w ⋆ α (1)) + µ(2) f (w ⋆ (2)) -f (w ⋆ α (2)) -µ(2)w ⋆ α (2)(r(1) -r(2)) ≥α µ(1) w ⋆ (1) -w ⋆ α (1) • f ′ (w ⋆ α (1)) -µ(2)f (w ⋆ α (2)) -µ(2)w ⋆ α (2)(r(1) -r(2)) (32) =αµ(2) f ′ (w ⋆ α (1)) • w ⋆ α (2) -f (w ⋆ α (2)) -w ⋆ α (2) • r(1) -r(2) α , In ( 32), we used the convexity of regularization function f as well as the fact that f (w ⋆ (2)) = f (0) = 0. Moreover, (33) holds because µ(1) (w ⋆ (1) -w ⋆ α (1)) = µ(1) 1 µ(1) -w ⋆ α (1) = 1 -µ(1)w ⋆ α (1) = µ(2)w ⋆ α (2). By KKT conditions we also have f ′ (w ⋆ α (1)) = r(1) -v ⋆ α α = r(1) -r(2) + αf ′ (w ⋆ α (2)) α = r(1) -r(2) α + f ′ (w ⋆ α (2)). ( ) Plugging (34) back into (33), we obtain L MAB α (w ⋆ α , •) -L MAB α (w ⋆ , •) ≥αµ(2) f ′ (w ⋆ α (2)) • w ⋆ α (2) -f (w ⋆ α (2)) ≥αµ(2) • M f 2 w ⋆ α (2) 2 > αµ(2) • M f 2 c 2 ≳ α, where ( 35) is based on the fact that f is M f -strongly convex, and that w ⋆ α (2) > c proved in Step 1. We now prove that such large lower bound on population objective difference leads the algorithm to select w ⋆ α . Recall from Lemma 3 that with at least constant probability (e.g. setting δ = 0.1), for any v ∈ V, w ∈ W, one has the following bound on difference between the population and empirical objectives L MAB α (w, v) -LMAB α (w, v) ≲ 2ϵ MAB stat,α , where ϵ MAB stat,α is of order 1/ √ N as defined in ( 18). Combining the above inequality with ( 35), for any v, v ′ ∈ V we have LMAB α (w ⋆ α , v) -LMAB α (w ⋆ , v ′ ) ≳ α -ϵ MAB stat,α ≳ α -(1 + α)N -1 2 ≳ N β -N -1 2 . Therefore, since β > -1/2, we conclude that w ⋆ α is chosen by the algorithm with constant probability: min v∈V LMAB α (w ⋆ α , v) -min v∈V LMAB α (w ⋆ , v) > 0 ⇒ w ⋆ α = argmax w∈W min v∈V LMAB α (w, v). Combining the above result with the suboptimality lower bound of π ⋆ α in (31) completes the proof for α ≥ N β .

D.2.2 PROOF FOR SMALL α

Now suppose α ≤ O(N -1 2 ), where O hides the logarithmic factors. In this case, we consider the following two-state two-arm contextual bandit instance: • State and reward distributions: We construct the states such that state 1 has a very small probability mass. For state 1, the first arm is optimal with a Bernoulli-distributed reward and the second arm is suboptimal with a deterministic reward. For state 2, both arms have deterministic rewards. Importantly, state 1 has a constant value gap in its suboptimal action. ρ(1) = N -1 4 , r(1, 1) ∼ Bernoulli 1 2 , r(1, 2) ≡ 1 3 ; ρ(2) = 1 -N -1 4 , r(2, 1) ≡ 1 2 , r(2, 2) ≡ 1 3 . • Data distribution: We assume that for both states, most of the probability density is concentrated on the optimal arm. µ(s) = ρ(s), s = 1, 2. µ(1|1) = µ(1|2) = 1 - 2 N , µ(2|1) = µ(2|2) = 2 N . • Function classes: Let w be defined as w(2, a) = w ⋆ α (2, a) and w(1, a) = 0 for a = 1, 2. Consider the following function classes W and V: W = {w ⋆ α , w}, V = {v ⋆ α , v ⋆ }. ( ) The proof is broken down into 4 steps. In the first step, we show that when α ≤ O(N -1 2 ) and N is sufficiently large, the regularized optimal policy is the same as the unregularized optimal policy, i.e., w ⋆ α = w ⋆ . Therefore, the function class W defined in ( 36) is realizable w ⋆ α = w ⋆ ∈ W. In the second step, we prove that with constant probability v ⋆ α = argmin v∈V LCB α ( w, v). Then, we show that solving the saddle point of the empirical objective LCB α (w, v) selects w over w ⋆ α with a constant probability. Finally, we prove that w induces a policy π w that suffers from suboptimality of order N -1/4 , which completes the proof. Step 1: Regularized optimal weights coincides with unregularized optimal weights. Since the population optimization problem ( 24) is independent across states at a population level, we can use the result of Lemma 1 to conclude that v ⋆ α (s) = r ⋆ (s) -c(s)α, and w ⋆ α (s, a) = max 0, (f ′ ) -1 r(s, a) -v ⋆ α (s) α = max 0, (f ′ ) -1 c(s) - r ⋆ (s) -r(s, a) α , where 0 ≤ c(s) ≤ f ′ (C ⋆ ) for s ∈ {1, 2}. Since r ⋆ (s) -r(s, 2) = 1 6 = Θ(1), for N ≥ (6f ′ (C ⋆ )) 2 , we have w ⋆ α (s, 2) = 0 for the suboptimal arm 2. Thus w ⋆ α (s) = w ⋆ (s) = 1 µ(1|s) . Correspondingly, we can use the KKT conditions to compute v ⋆ α (s) = r ⋆ (s) -αf ′ 1 µ(1|s) . Step 2: v ⋆ α = argmin v∈V LCB α ( w, v) with constant probability. Let μ denote the empirical statearm distribution and r denote the empirical mean reward. Define the following event: E := a μ(a|s) w(s, a) ≤ 1 for s ∈ {1, 2} . Recall that we defined w(1, a) = 0 and w(2, a) = w ⋆ (2, a). Thus, the above event can be equivalently written as a μ(a|2)w ⋆ α (2, a) ≤ 1 ⇐⇒ a (μ(a|2) -µ(a|2)) w ⋆ α (2, a) ≤ 0. Here we used the fact that a µ(a|2)w ⋆ α (s, 2) = 1. Moreover, in Step 1 we showed that w ⋆ α = w ⋆ , thus w ⋆ α (2, 2) = 0 and (38) corresponds to the following event E = {μ(1|2) -µ(1|2) ≤ 0} . Since μ(1|2) is an empirical version of the conditional probability µ(1|2), event E happens with probability 1 2 . We condition on the event E for the rest of the proof. Using the fact that v ⋆ α (s) ≤ r ⋆ (s) = v ⋆ (s), we conclude that LCB α ( w, v ⋆ α ) ≤ LCB α ( w, v ⋆ ) ⇒ LCB α ( w, v ⋆ α ) = min v∈V LCB α ( w, v). ( ) Step 3: Analyzing the probability of picking w ⋆ α . Now we compare the value of LCB α (•, v ⋆ α ) evaluated at w and w ⋆ α . We use the definition w(2, a) = w ⋆ α (2, a) and write LCB α (w ⋆ α , v ⋆ α ) -LCB α ( w, v ⋆ α ) =μ(1) r(1, 1)μ(1|1)w ⋆ α (1, 1) + αμ(1|1)f (w ⋆ α (1, 1)) + v ⋆ α (1) a μ(a|1) (w(1, a) -w ⋆ α (1, a)) Noting that v ⋆ α (1) = r(1, 1) -αf ′ (w ⋆ α (s 1 , a 1 )), w(1, a) = 0, and w ⋆ α (1, 1) = w ⋆ (1, 1) = 1 µ(1|1) , we further simplify the above equation LCB α (w ⋆ α , v ⋆ α ) -LCB α ( w, v ⋆ α ) =μ(1) r(1, 1) -r(1, 1) + αf ′ 1 µ(1|1) μ(1|1) µ(1|1) + αμ(1|1)f 1 µ(1|1) (41) =μ(1, 1) r(1, 1) -r(1, 1) µ(1|1) + α • 1 µ(1|1) f ′ 1 µ(1|1) + f 1 µ(1|1) . We then prove that with constant probability, the first term in ( 42) is negative with a magnitude larger than the second term: r(1, 1) -r(1, 1) µ(1|1) ≲ -N -3/8 . ( ) The proof of this inequality relies on anti-concentration bounds of binomial random variables and is presented at the end of this section. By Inequality (43) combined with (40), we conclude that min v∈V LCB α ( w, v) = LCB α ( w, v ⋆ α ) > LCB α (w ⋆ α , v ⋆ α ) ≥ min v∈V LCB α (w ⋆ α , v), which guarantees that the algorithm picks w with a constant probability. Step 4: Suboptimality of π w Finally, for the policy π w induced by w, we have J(π ⋆ α ) -J(π w ) = µ(s 1 )π w (2|1)(r(1, 1) -r(1, 2)) = N -1 4 12 ≥ Ω(N β ), for β = -1 4 > -1 2 , as desired. The proof for small α is thus complete. Proof of Inequality (43). Using the Chernoff bounds for binomial random variables given in Lemma 2 (adapted from Proposition 7.3.2 of Matoušek & Vondrák (2001) ), one can conclude that the following event E ′ happens with probability at least 0.5: E ′ := N (1, 1) ≥ 0.1N µ(1, 1) ≥ 0.05N 3 4 . ( ) Furthermore, E and E ′ are independent because the random variable r(s 1 , a 1 ) is independent from the arm distribution within state s 2 . Therefore, conditioning on E ∩ E ′ which happens with probability 0.5 × 0.5 = 0.25, we use the anti-concentration bounds for Binomial random variables Lemma 5 to obtain the following lower bound: Pr r(1, 1) -r(1, 1) ≤ - log(2c 1 ) c 2 N (1, 1) ≤ -c ′ N -3 8 E ∩ E ′ ≥ 0.5, where c ′ = 20 log(2c1) c2 is a universal constant. Therefore, we have established that (43) holds with constant probability. Lemma 5 (Anti-concentration of Binomial random variables) Let X 1 , • • • , X n be independent random variables following the Bernoulli distribution with mean 1 2 , and let X = 1 n n i=1 X i be the empirical mean. Then we have that for any t ∈ [0, 1 8 ] and universal constants c 1 , c 2 , Pr X ≤ E[X] -t ≥ c 1 e -c2t 2 n .

D.3 PROOF OF THEOREM 3

Proof of this theorem largely follows similar steps as the proof we presented for Theorem 1. In particular, we start by presenting two lemmas. The first lemma leverages Hoeffding's inequality to establish the closeness of the population objective (7) and empirical objective (8). Additionally, we show that this result leads to the closeness of population objective at w ⋆ and ŵ. Proof of this lemma is presented at the end of this subsection. Lemma 6 (Empirical and population closeness, CB) Fix δ > 0 and define ϵ CB stat := 3(B w + 1) 2 (B v + 1) log(|W||V|/δ) N . For any w ∈ W and v ∈ V, the following statements hold with probability at least 1 -δ (I) L CB AL (w, v) -LCB AL (w, v) ≤ ϵ CB stat ; (II) L CB AL (w ⋆ , v) -L CB AL ( ŵ, v) ≤ 2ϵ CB stat . In the second lemma, we prove that the ALM term enforces a lower bound on normalization factors d ŵ µ (s) := a ŵ(s, a)µ(a|s) for significant states. Lemma 7 (Occupancy validity enforced by the ALM) Define the state space subset S s := s d ŵ µ (s) ≤ 1 2 . ( ) For any fixed δ > 0, the following statements hold with probability at least 1 -δ, (I) E s,a∼µ [(r ⋆ (s) -r(s, a)) ŵ(s, a)] ≲ ϵ CB stat ; (II) s∈Ss µ(s) ≲ ϵ CB stat ; where ϵ CB stat is defined in (48). Given the two lemmas above, our suboptimality analysis can be broken down into two simple steps. First, we partition the states based on S s defined in (49) and decompose the policy suboptimality accordingly: s µ(s)V ⋆ (s) - s µ(s)V π (s) = s∈Ss µ(s)(V ⋆ (s) -V π (s)) + s̸ ∈Ss µ(s)(V ⋆ (s) -V π (s)), ≲ ϵ CB stat + s̸ ∈Ss µ(s)(V ⋆ (s) -V π (s)) ≤ ϵ CB stat + 2 s d ŵ(s)(V ⋆ (s) -V π (s)) In ( 50), we used part (II) in Lemma 7 to bound the first term and (51) uses the fact that by definition, for all s ̸ ∈ S s we have µ(s) < 2 d(s) and V ⋆ (s) -V π (s) ≥ 0. Moreover, the second term in ( 51) is bounded by part (I) of Lemma 7 since s d ŵ(s)(V ⋆ (s) -V π (s)) = s:d ŵ (s)>0 d ŵ(s) r ⋆ (s) - a π(a|s)r(s, a) = s:d ŵ (s)>0 a d ŵ(s, a)r ⋆ (s) -d ŵ(s) ŵ(s, a)µ(s, a) d ŵ(s) r(s, a) = s:d ŵ (s)>0 a ŵ(s, a)µ(s, a)r ⋆ (s) -ŵ(s, a)µ(s, a)r(s, a) ≤ s,a µ(s, a) ŵ(s, a)(r ⋆ (s) -r(s, a)) ≲ ϵ CB stat , where the equations follow from the definition of π. The final suboptimality bound is proved by noting that (B w + 1) 2 ≍ B 2 w since B w ≥ 1 due to realizability of w ⋆ (s, a ⋆ ) ≥ 1. Proof of Lemma 6. To prove part (I), notice that E µ LCB AL (w, v) = L CB AL (w, v). Furthermore, LCB AL (w, v ) is an empirical average of i.i.d. random variables which are bounded by w(s, a)r(s, a) -v(s)(w(s, a) -1) - a w(s, a)µ(a|s) -1 2 ≤ B w + B v (B w + 1) + B 2 w ≤ 3(B w + 1) 2 (B v + 1) Applying Hoeffding's inequality along with a union bound on w ∈ W and v ∈ V finishes the proof of part (I). We now prove part (II). For the primal-dual objective without the AL term max w≥0 min v L CB (w, v) := E s,a∼µ [w(s, a)r(s, a)] -E s,a∼µ [v(s)(w(s, a) -1)], we have (w ⋆ , v ⋆ ) ∈ argmax w≥0 argmin v L CB (w, v) by strong duality. Moreover, since w ⋆ is realizable, it satisfies the validity constraint E a∼µ(•|s) [w ⋆ (s, a)] = 1 for all s. Therefore, by Lemma 14 adding the ALM term does not change the optimal solution and we have (w ⋆ , v ⋆ ) ∈ argmax w≥0 argmin v L CB AL (w, v). We follow similar steps as in the proof of Lemma 1 and decompose L CB AL (w ⋆ , v) -L CB AL ( ŵ, v) ac- cording to L CB AL (w ⋆ , v) -L CB AL ( ŵ, v) = L CB AL (w ⋆ , v) -L CB AL (w ⋆ , v(w ⋆ )) :=T 1 + L CB AL (w ⋆ , v(w ⋆ )) -LCB AL (w ⋆ , v(w ⋆ )) :=T 2 + LCB AL (w ⋆ , v(w ⋆ )) -LCB AL ( ŵ, v) :=T 3 + LCB AL ( ŵ, v) -LCB AL ( ŵ, v) :=T 4 + LCB AL ( ŵ, v) -L CB AL ( ŵ, v) :=T 5 , where vw = arg min v∈V LCB AL (w, v) . Each term is bounded as follows: • T 1 = 0 because w ⋆ satisfies the optimization constraints. • T 2 ≤ ϵ CB stat due to Lemma 6. • T 3 ≤ 0 because ŵ = arg max w∈W LCB AL (v w , w). • T 4 ≤ 0 because v = arg min v∈V LCB AL (v, ŵ). • T 5 ≤ ϵ CB stat due to Lemma 6. Summing up the bounds on each term proves part (II). □ Proof of Lemma 7. We leverage the closeness of the objective at w ⋆ and ŵ established in Lemma 6 to show that the ALM term at ŵ is small. Since w ⋆ satisfies the validity constraints, the objective at w ⋆ simplifies to L CB AL (w ⋆ , v) = E s,a∼µ [r(s, a)w ⋆ (s, a)] + E s,a∼µ [v(s)(1 -w ⋆ (s, a))] =0 -E s∼µ [(E a∼µ(•|s) [w(s, a)] -1) 2 ] =0 = E s,a∼µ [r(s, a)w ⋆ (s, a)]. Consider the objective difference at v(s) = r ⋆ (s) := max a r(s, a):  L CB AL (w ⋆ , r ⋆ ) -L CB AL ( ŵ, r ⋆ ) = s µ(s)r ⋆ (s) - + E s∼µ d ŵ µ (s) -1 2 = s,a µ(s, a)[r ⋆ (s) -r(s, a)] ŵ(s, a) + E s∼µ d ŵ µ (s) -1 2 . Since L CB AL (w ⋆ , v) -L CB AL ( ŵ, v) ≲ ϵ CB stat by Lemma 6, we conclude that s,a µ(s, a)[r ⋆ (s) -r(s, a)] ŵ(s, a) + E s∼µ d ŵ µ (s) -1 2 ≲ ϵ CB stat Moreover, since the first term is nonnegative due to ŵ(s, a) ≥ 0 and r ⋆ (s) -r(s, a) ≥ 0, both of the terms in the above inequality are bounded by ϵ CB stat and thereby proving part (I). The above result also allows us to bound the mass on the subset S s that contains the states that violate state occupancy validity ϵ CB stat ≳ s µ(s) d ŵ µ (s) -1 2 ≥ s∈Ss µ(s) d ŵ µ (s) -1 2 ≥ 1 4 s∈Ss µ(s) ≳ s∈Ss µ(s), where we used the fact that d ŵ µ (s) ≤ 1 2 and thus d ŵ µ (s) -1 2 ≥ 1 4 by definition of S s . This concludes the proof of part (II) . □

E PROOFS FOR MDPS

In this section, we begin by introducing some additional notation. The original primal-dual objective without ALM term is given by max w≥0 min v L MDP (w, v) := (1 -γ)E s∼ρ [v(s)] + E s,a∼µ [w(s, a)e v (s, a)] . Define w ⋆ (s, a) = d π ⋆ (s, a)/µ(s, a) and v ⋆ (s) = V * (s). By strong duality, one has (w ⋆ , v ⋆ ) ∈ arg max w≥0 arg min v L MDP (w, v). Additionally, define ζ ⋆ w,u = arg max ζ<0 L model-free AL (w, v, u, ζ), ∀w ∈ W, u ∈ U and ζ ⋆ w = ζ ⋆ w,u ⋆ w ∀w ∈ W where u ⋆ w is de- fined in Theorem 4. Also, denote ζ ⋆ = ζ ⋆ w ⋆ and u ⋆ = u ⋆ w ⋆ . The rest of this section is organized as follows. In Appendix E.1, we provide some details regarding practical implementation of the offline learning algorithm with ALM. In Appendix E.2, we derive the objective of model-free ALMIS algorithm. Appendix E.3 contains the proof of performance upper bound on model-based and model-free ALMIS algorithms (Theorem 4), which relies on several lemmas subsequently proved in Appendices E.4 through E.7.

E.1 ON PRACTICAL IMPLEMENTATIONS

In our algorithms for CB and MDP, we need to compute summations of form a∈A . This can be implemented efficiently when |A| is small. When |A| is large or even infinite, one can utilize numerical methods to estimate the summation with desired precision. Additionally, in Algorithm 3, we need to evaluate a term s ′ ,a ′ P (s ′ |s, a)π w (a ′ |s ′ )u(s ′ , a ′ ). In practice, we can evaluate this term by numerical integration. For MAB, CB, and model-based RL, our algorithms need to solve a max-min(-min) problem. For model-free RL, the max-min-min-max can be converted to a max(-max)-min(-min) problem. This is because we can first exchange min u and max ζ since L model-free AL as defined in ( 54) is convex-concave w.r.t. (u, ζ) . Then, we can exchange min v and max ζ since v and ζ are not coupling in L model-free AL . Therefore, our algorithms only require a max-min oracle, which is also required in prior works on provable conservative offline RL with general function approximators such as (Zhan et al., 2022) . Moreover, many practically successful offline RL algorithms also solve minimax problems such as the DICE family (Nachum et al., 2019b; a; Yang et al., 2020; Lee et al., 2021) E.2 DERIVATION OF THE MODEL-FREE ALMIS OBJECTIVE For f (x) = (x -1) 2 , the Fenchel conjugate f * is given by f * (x) = max y (xy -f (y)) = max y xy -y 2 + 2y -1 = x + 2 2 2 -1. ( ) Since d w (s)/d πw (s) ≥ 0, we have x ⋆ w (s, a) ≥ -2 and thus it is sufficient to only consider domain x(s, a) ≥ -2, over which f * (x) is invertible. +∞) . Similar to Nachum et al. (2019a) , we use Fenchel duality to estimate g (u(s, a) -γP πw u(s, a)). By Fenchel duality, any convex function g(x) can be written as g(x) = max ζ xζ -g * (ζ). In the case of g(x), the Fenchel conjugate is given by g * (x) = -x -2 -1/x with domain x < 0. Therefore, we write Bounding the suboptimality of policies returned by both model-based and model-free variants of ALMIS follow a similar analysis. We first characterize the statistical error in approximating population objectives by their empirical versions and use it to establish the closeness of ŵ and w ⋆ . The lemma below captures these approximation errors for the model-based objective, whose proof can be found Appendix E.5. Let g(x) = -f -1 * (x) = 2 -2 √ x + 1, which is a convex function on [-1, Lemma 9 (Empirical and population closeness, model-based ALMIS) Fix δ > 0 and define ϵ model-based stat := (B u + (1 + B v )B w ) B u log(|P||U||W||V|/δ) N . ( ) For any w ∈ W, v ∈ V, and u ∈ U, the following statements hold with probability at least 1 -δ (I) L model-based AL (w, v, u) -Lmodel-based AL (w, v, u) ≤ ϵ model-based stat ; (II) L model-based AL (w ⋆ , v ⋆ , u ⋆ ) -L model-based AL ( ŵ, v ⋆ , u ⋆ ŵ) ≤ 2ϵ model-based stat . In Appendix E.6, we prove a similar lemma for the model-free objective. Lemma 10 (Empirical and population closeness, model-free ALMIS) Fix δ > 0 and define ϵ model-free stat := (B u + (1 + B v + B ζ (B u + 1))B w ) log(|U||W||V||Z|/δ) N . ( ) For any w ∈ W, v ∈ V, and u ∈ U, the following statements hold with probability at least 1 -δ (I) L model-free AL (w, v, u) -Lmodel-free AL (w, v, u) ≤ ϵ model-free stat ; (II) L model-free AL (w ⋆ , v ⋆ , u ⋆ ) -L model-free AL ( ŵ, v ⋆ , u ⋆ ŵ) ≤ 2ϵ model-free stat . The final key lemma demonstrates that in model-based and model-free ALMIS, the ALM terms enforce lower bounds on the ratio of the estimated occupancy of learned weights d ŵ(s) and the actual occupancy of the learned policy d π ŵ (s) in most states. The proof of this lemma is given in Appendix E.7. Lemma 11 (Occupancy validity by the ALM, MDP) For ŵ computed by the model-based ALMIS Algorithm 3, define the state space subspace S s := s d ŵ(s) ≤ 1 2 d π ŵ (s) . For any fixed δ > 0, the following statements hold with probability at least 1 -δ (I) E s,a∼µ [-A ⋆ (s, a) ŵ(s, a)] ≲ ϵ model-based stat ; (II) s∈Ss d π ŵ (s) ≲ (1 -γ) -2 ϵ model-based stat . Similarly, for ŵ computed by the model-free ALMIS Algorithm 4, define the state space subspace S s := s d ŵ(s) ≤ 1 2 d π ŵ (s) . For any fixed δ > 0, the following statements hold with probability at least 1 -δ (I) E s,a∼µ [-A ⋆ (s, a) ŵ(s, a)] ≲ ϵ model-free stat ; (II) s∈Ss d π ŵ (s) ≲ (1 -γ) -2 ϵ model-free stat . Given the above lemmas, we proceed to prove the suboptimality bounds in terms of statistical errors defined in ( 55) and ( 56). In the rest of this section, we drop the superscripts model-based and model-free from statistical errors to avoid cluttered notation. In view of the performance difference lemma in Kakade & Langford (2002, Lemma 6 .1), one has J(π ⋆ ) -J(π) = E s∼d π a A ⋆ (s, a) (π ⋆ (a|s) -π(a|s)) = E s∼d π a -A ⋆ (s, a)π(a|s) , where d π = d π ŵ . Here, we used the fact that the expectation of the optimal advantage over optimal policy is zero a A ⋆ (s, a)π ⋆ (a|s) = 0. Lemma 11 links an expectation of -A ⋆ (s, a) to the statistical error. With this lemma at hand and using the definition S s = {s | d ŵ(s) ≤ d π (s)/2}, we continue to decompose and bound the suboptimality E s∼d π a -A ⋆ (s, a)π(a|s) = s∈Ss d π (s) a -A ⋆ (s, a)π(a|s) + s̸ ∈Ss d π (s) a -A ⋆ (s, a)π(a|s) ≲ 1 (1 -γ) 3 ϵ stat + s̸ ∈Ss,d ŵ (s)̸ =0 d π (s) d ŵ(s) a -A ⋆ (s, a) ŵ(s, a)µ(s, a) + s̸ ∈Ss,d ŵ (s)=0 d π (s) a - 1 |A| A ⋆ (s, a) ≤ 1 (1 -γ) 3 ϵ stat + 2 s̸ ∈Ss a -A ⋆ (s, a) ŵ(s, a)µ(s, a) In (57), we used part (II) in Lemma 11 and that -A ⋆ (s, a) ≤ 1/(1 -γ) and in (58) we used the definition of S s to bound the ratio d π (s)/d ŵ(s) by 2 and the fact that d ŵ(s) = 0 implies d π (s) = 0 for s / ∈ S s . We then apply part (I) in in Lemma 11 to bound the second term by E s,a∼µ [-A ⋆ (s, a) ŵ(s, a)] and thus the overall suboptimality: J(π ⋆ ) -J(π) ≲ 1 (1 -γ) 3 ϵ stat + E s,a∼µ [-A ⋆ (s, a) ŵ(s, a)] ≲ 1 (1 -γ) 3 ϵ stat . E.4 PROOF OF LEMMA 8 Derivation of x ⋆ w . Recall from Appendix E.2 that for f (x) = (x -1) 2 , the Fenchel conjugate is f * (x) = x+2 2 2 -1. Therefore, for any (s, a), x ⋆ w (s, a) = arg max x d w (s)x -d πw (s) x + 2 2 2 -1 = 2 d w (s) d πw (s) -2 ⇒ xw (s, a) = clip 2 d w (s) d πw (s) -2, -B x , B x . Bound on u ⋆ w . Recall that u ⋆ w is defined as the fixed point of the following Bellman-like equation u(s, a) = f * (x w (s, a)) + γ(P πw u)(s, a). The above equation has a solution since f * (x w (s, a)) is bounded 2 -B x 2 2 -1 ≤ f * (x w (s, a)) ≤ B x + 2 2 2 -1. One can view u ⋆ w as the Q-function of policy π w with the reward function f * (x w (s, a)), which leads to |u ⋆ w (s, a)|≤ 1 1-γ max 1 -2-Bx 2 2 , Bx+2 2 -1 = 1 1-γ (B 2 x /4 + B x ). Bound on ζ ⋆ w,u . To see the bound on ζ ⋆ w,u , recall that by definition, ζ ⋆ w,u = arg max ζ<0 E (s,a,s ′ )∼µ,a ′ ∼πw(•|s ′ ) [w(s, a) ((u(s, a) -γu(s ′ , a ′ ) + 1)ζ(s, a) + 1/ζ(s, a))]. It is easy to show that |ζ ⋆ w,u (s, a)|= (u(s, a) -γ (P πw u)(s ′ , a ′ ) + 1) -1/2 = (f * (x w (s, a)) + 1) -1/2 . Since xw (s, a) ∈ [-B x , B x ], we have |ζ ⋆ w,u (s, a)|∈ 2 2+Bx , 2 2-Bx . E.5 PROOF OF LEMMA 9 E.5.1 PROOF OF PART (I) We decompose the difference between the population and empirical objective into three terms L model-based AL -Lmodel-based AL = T 1 + T 2 + T 3 defined as follows T 1 := (1 -γ)E ρ v(s) + a u(s, a)π w (a|s) -(1 -γ) 1 N 0 N0 i=1 v(s i ) + a u(s i , a)π w (a|s i ) T 2 := E µ w(s, a)(r(s, a) + γ s ′ P (s ′ |s, a)v(s ′ ) -v(s)) - 1 N N i=1 w(s i , a i ) [r i + γv(s ′ i ) -v(s i )] T 3 := E µ w(s, a) f -1 * (u(s, a) -γP πw u(s, a)) - 1 N N i=1 w(s i , a i ) f -1 * u(s i , a i ) -γ P πw u(s i , a i ) We subsequently show that the absolute values of the above error terms satisfy the following high probability upper bounds: |T 1 | ≲ (B v + B u ) log|V||U|/δ N 0 , |T 2 | ≲ (1 + B v )B w log|V||W|/δ N , |T 3 | ≲ B w B u log|P||U||W|/δ N Taking N 0 = N and noting that B w ≥ 1 due to realizability of w ⋆ yield that  L model-based AL -Lmodel-based AL ≲ (B v + B u ) log|V||U |/δ N 0 + (1 + B v )B w B u log(|P||U||W||V|/δ) N ≲ ϵ model- ) -v(s))| ≤ B w (1 + (γ + 1)B v ) ≤ B w (1 + γ)(1 + B v ). As before, due to boundedness and independence of variables w(s i , a i )[r i + γv(s ′ i ) -v(s i )], Hoeffding's inequality can be applied, giving the bound (61b) on |T 2 |. Proof of the bound (61c) on |T 3 |. We decompose T 3 = T 3,1 + T 3,2 , where T 3,1 and T 3,2 are defined as T 3,1 := E µ w(s, a) f -1 * (u(s, a) -γ(P πw u)(s, a)) -E µ w(s, a) f -1 * u(s, a) -γ( Pπw u)(s, a) T 3,2 := E µ w(s, a) f -1 * u(s, a) -γ( Pπw u)(s, a) - 1 N N i=1 w(s i , a i ) f -1 * u(s i , a i ) -γ( Pπw u)(s i , a i ) Recall that f -1 * (x) = 2 √ x + 1 -2 from Appendix E.2. The absolute value of T 3,2 can be immediately bounded using Hoeffding's inequality: |T 3,2 |≲ B w B u log|W||U|δ N . To bound |T 3,1 |, we first use the inequality given in Lemma 13, setting b i , x i , y i for each (s, a) according to b i = 1 + u(s, a) i = 0 γ a ′ π w (a ′ |s ′ )u(s ′ |a ′ ) 1 ≤ i ≤ |S| x i = 1 i = 0 P (s ′ |s, a) 1 ≤ i ≤ |S| , y i = 1 i = 0 P (s ′ |s, a) 1 ≤ i ≤ |S| Thus by Lemma 13, we obtain the following bound on T 2 3,1 T 2 3,1 = E µ w(s, a) f -1 * (u(s, a) -γP πw u(s, a)) -E µ w(s, a) f -1 * u(s, a) -γ Pπw u(s, a) 2 ≲ B w E µ   1 + u(s, a) -γ s ′ P (s ′ |s, a) a ′ π w (a ′ |s ′ )u(s ′ , a ′ )   -E µ   1 + u(s, a) -γ s ′ P (s ′ |s, a) a ′ π w (a ′ |s ′ )u(s ′ , a ′ )   2 ≤ B 2 w B u E µ s ′ P (s ′ |s, a) -P (s ′ |s, a) 2 . ( ) Note that the terms under square root are always nonnegative because for any transition P 1 + u(s, a) -γ s ′ P (s ′ |s, a) a ′ π w (a ′ |s ′ )u(s ′ , a ′ ) ≥ 1 -B u -γB u ≥ 1 -2B u ≥ 0. Then, we use the concentration result on maximum likelihood model estimation stated in Theorem 6 and a union bound on w ∈ W and v ∈ V to conclude that Decompose the objective difference according to |T 3,1 |≲ B w B u log|P||U||W|/δ N . L model-based AL (w ⋆ , v ⋆ , u ⋆ ) -L model-based AL ( ŵ, v ⋆ , u ⋆ ŵ) = L model-based AL (w ⋆ , v ⋆ , u ⋆ ) -L model-based AL (w ⋆ , vw ⋆ , ûw ⋆ ) := T 1 + L model-based AL (w ⋆ , vw ⋆ , ûw ⋆ ) -Lmodel-based AL (w ⋆ , vw ⋆ , ûw ⋆ ) := T 2 + Lmodel-based AL (w ⋆ , vw ⋆ , ûw ⋆ ) -Lmodel-based AL ( ŵ, v ŵ, û ŵ) := T 3 + Lmodel-based AL ( ŵ, v ŵ, û ŵ) -Lmodel-based AL ( ŵ, v ⋆ , u ⋆ ŵ) := T 4 + Lmodel-based AL ( ŵ, v ⋆ , u ⋆ ŵ) -L model-based AL ( ŵ, v ⋆ , u ⋆ ŵ) := T 5 We bound each term: • w(s i , a i ) u(s i , a i ) -γ a ′ ∈A u(s ′ i , a ′ )π w (a ′ |s ′ i ) ζ(s i , a i ) -g ⋆ (ζ(s i , a i )) . The absolute values of the error terms above satisfy the following upper bounds with high probability We then show that each term inside the expectation is nonnegative:  |T 1 | ≲ (B v + B u ) log(|V||U|/δ) N 0 , |T 2 | ≲ (1 + B v )B w log(|V||W|/δ) N , d πw (s) B x - B x 2 + 1 2 -1 ≥ B x 2 + 1 B x - B 2 x 4 -B x = B 2 x 4 ≥ 0. 3. Similarly, when dw(s) d πw (s) < 1 -B x /2, substitute xw (s, a) = -B x to arrive at - d w (s) d πw (s) B x - 1 - B x 2 2 -1 ≥ B x 2 -1 B x - B 2 x 4 + B x = B 2 x 4 ≥ 0. (72) E.7.2 PROOF OF PART (II) We derive the second part by using the bound (70b) restricted on the set S s . When s ∈ S s , we have We demonstrated that Lagrange multipliers are not sufficient to enforce occupancy validity and we need to use additional penalty terms. Furthermore, we discussed why ensuring ratio-based occupancy validity is compatible with the single-policy concentrability definition, resulting in learning a policy whose actual occupancy is within the data distribution. However, one might wonder whether a more standard application of the ALM term, which involves adding a squared penalty on Bellman flow error, leads to a similar policy validity guarantee. This idea is appealing because it avoids variational lower bound and additional variables. However, here we provide an intuitive argument that a penalty term on Bellman flow error does not appear to be sufficient to ensure a ratio-based occupancy validity guarantee. We use ϵ(s) to denote Bellman flow error defined as ϵ(s) = (1 -γ)ρ(s) + γ s ′ ,a ′ P (a|s ′ , a ′ )µ(s ′ , a ′ )w(s ′ , a ′ )a w(s, a)µ(s, a). We show that even such a strong state-wise guarantee on Bellman error cannot generally lead to Therefore, to ensure a constant bound on d π ŵ (s) d ŵ (s) , we require Bellman flow error to be pointwise smaller than the initial distribution. For state s with ρ(s) = 0, this means that the Bellman flow error is required to be nonpositive: ϵ(s) ≤ 0. However, even state-wise minimization of squared penalty terms such as ϵ 2 (s) can only ensure |ϵ(s)| to be small.

F ROBUSTNESS TO MODEL MISSPECIFICATION AND OPTIMIZATION ERROR

In this section, we study the sample complexity of our algorithm in the presence of model misspecification and optimization error similar to Zhan et al. (2022) . Since in practice, it might be the case that our function classes W, V do not contain w ⋆ , v ⋆ , similar to Zhan et al. (2022) , we measure the approximation errors of W and V by ϵ r,v = min v∈V ∥v -v ⋆ ∥ 1,ρ +∥v -v ⋆ ∥ 1,µ +∥v -v ⋆ ∥ 1,µ ′ , ϵ r,w,w ⋆ = min w∈W ∥w -w ⋆ ∥ 1,µ , where w ⋆ = d ⋆ /µ and d ⋆ is the (discounted) occupancy frequency of any optimal policy π ⋆ , ∥•∥ 1,ρ is weighted l 1 norm w.r.t. ρ, and µ ′ (s) = s ′ ,a ′ P (s|s ′ , a ′ )µ(s ′ , a ′ ). The model misspecification error is measured in l 1 norm, which is weaker than l ∞ norm.



Notice that the validity constraints in MAB and CB are special cases of this constraint. When µ(a|s) is unknown, behavioral cloning can be used(Ross & Bagnell, 2014;Zhan et al., 2022).



for all x ≥ x 0 and use O(•) to be the big-O notation ignoring logarithmic factors. Define clip(x, a, b) ≜ max{a, min{x, b}} for x, a, b ∈ R.

For any w ∈ W, define vw = arg min v∈V LMAB α (w, v)

)We have E a∼µ [w ⋆ (a)] = 1 due to realizability and E a∼µ [ ŵ(a)] = 1 is our assumption. Thus the second term in (23) is zero. Moreover, note that π(a) = ŵ(a)µ(a)/E a∼µ [w(a)] = ŵ(a).

v) := Es,a∼µ [w(s, a)r(s, a)] -Es,a∼µ[v(s)(w(s, a) -1)] -αEs,a∼µ [f (w(s, a))] ,

E µ [w(s, a)g (u(s, a) -γP πw u(s, a))] = E µ [w(s, a) max ζ<0 (u(s, a) -γ(P πw u)(s, a)) ζ -g * (ζ)] = E µ [w(s, a) max ζ<0 (u(s, a) -γ(P πw u)(s, a)) ζ + ζ + 1/ζ + 2]. The interchangeability principle (Rockafellar & Wets, 2009; Dai et al., 2017) allows us to convert the inner maximization step over scalar ζ to an overall maximization over ζ : S × A → R -. Replacing this term in the objective (13) results in the following objective: v, u, ζ) = (1 -γ)E s∼ρ v(s) + a u(s, a)π w (a|s) + E (s,a,s ′ )∼µ,a ′ ∼πw(•|s ′ ) [w(s, a) (e v (s, a) + (u(s, a) -γu(s ′ , a ′ ))ζ(s, a) -g * (ζ(s, a)))], deriving an expression for x ⋆ w and characterizing bounds on u ⋆ w and ζ ⋆ w,v in the following lemma. The proof is presented in Appendix E.4. Lemma 8 For any w ∈ W and v ∈ V, one has x ⋆ w (s, a) = 2d w (s)/d πw (s) -2, |u ⋆ w (s, a)|≤ 1 1-γ (B 2 x /4 + B x ), and |ζ ⋆ w,v (s,

.2 PROOF OF PART (II) To prove the second part, let vw and ûw denote the solutions to the model-based empirical objective vw ,

T 1 ≤ 0 because v ⋆ , u ⋆ = arg min v arg min u L model-based AL (w ⋆ , v, u); • T 2 ≤ ϵ model-based stat by Lemma 9; • T 3 ≤ 0 because ŵ = arg max w∈W Lmodel-based AL (w, vw , ûw ); • T 4 ≤ 0 because vw , ûw = arg min v∈V arg min u∈U Lmodel-based AL (w, v, u); • T 5 ≤ ϵ model-based stat by Lemma 9. E.6 PROOF OF LEMMA 10 E.6.1 PROOF OF PART (I) We decompose the difference L model-free AL -Lmodel-free AL = T 1 + T 2 + T 3 into three error termsT 1 := (1 -γ)E ρ v(s) + a u(s, a)π w (a|s) -(1 -γ) i , a)π w (a|s i ) T 2 := E µ w(s, a)(r(s, a) + γ s ′ P (s ′ |s, a)v(s ′ ) -v(s))i , a i ) [r i + γv(s ′ i ) -v(s i )] T 3 := E (s,a,s ′ )∼µ,a ′ ∼πw(•|s ′ ) [w(s, a) ((u(s, a) -γu(s ′ , a ′ ))ζ(s, a) -g ⋆ (ζ(s, a)))]

≲ (1 + B ζ (B u + 1))B w log|U||W||Z|/δ N .(65c)The bounds on the first two error terms |T 1 | and |T 2 | are already shown in Appendix E.5.1. To bound |T 3 |, recall that g ⋆ (x) = -x -2 -1 x , ∀x < 0. Also, |ζ(s, a)|∈ (B ζ,L , B ζ,U ) for any ζ ∈ Z and any (s, a), and B ζ ≜ max{B ζ,U , B -1 ζ,L }. Therefore, the individual error terms in |T 3 | satisfy the following bound|w(s, a) ((u(s, a) -γu(s ′ , a ′ ))ζ(s, a) -g ⋆ (ζ(s, a)))| ≤B w ((1 + γ)B u B ζ,U + B ζ,U + B -1 ζ,L + 2).Thus, by Hoeffding's inequality and a union bound on W, U, and Z, we obtain the upper bound (65c) on |T 3 |. Summing up the bounds given in (65a), (65b), and (65c) and noting that B w ≥ 1 due to realizability of w ⋆ , we obtainL model-free AL (w, v, u, ζ) -Lmodel-free AL (w, v, u, ζ) ≲ (B v + B u ) log|V||U|/δ N 0 + (1 + B v + B ζ (B u + 1))B w log|U||W||V||Z|/δ N ≲ ϵ model-freestat . E.6.2 PROOF OF PART (II) Define the following solutions to the empirical model-free objective vw , ûw , ζw = arg min v∈V arg min u∈U arg max ζ∈Z Lmodel-free AL (w, v, u, ζ), ∀w ∈ W ζ(w, u) = arg max ζ∈Z Lmodel-free AL (w, v, u, ζ) ∀w ∈ W, u ∈ UGiven the above expression of the objective at (w ⋆ , v ⋆ , u ⋆ ), we write the following objective differenceL model-based AL (w ⋆ , v ⋆ , u ⋆ ) -L model-based AL ( ŵ, v ⋆ , u ⋆ ŵ) =(1 -γ)E s∼ρ [V ⋆ (s)] -(1 -γ)E s∼ρ [v ⋆ (s)] -E s,a∼µ [ ŵ(s, a)e v ⋆ (s, a)] + (1 -γ)E s∼ρ,a∼π ŵ [u ⋆ ŵ(s, a)] + E µ ŵ(s, a)f -1 * (u ⋆ ŵ(s, a) -γ(P π ŵ u ⋆ ŵ)(s, a)) (68) = -E s,a∼µ [ ŵ(s, a)A ⋆ (s, a)] -E d π ŵ [f * (x ŵ(s, a))] + E d ŵ [x ŵ(s, a)]The last line uses e v ⋆ (s, a) = A ⋆ (s, a) as well as the definition of u ⋆ ŵ as the fixed point solution to u ⋆ ŵ(s, a) := f * (x ŵ(s, a)) + γ(P π ŵ u ⋆ ŵ)(s, a), which allows us to write (68) in the original f -divergence variational form (11) with x ŵ as variable. Lemma 9 asserts thatL model-based AL (w ⋆ , v ⋆ , u ⋆ ) -L model-based AL ( ŵ, v ⋆ , u ⋆ ŵ) ≲ ϵ model-based stat . Therefore, -E s,a∼µ [ ŵ(s, a)A ⋆ (s, a)] -E d π ŵ [f * (x ŵ(s, a))] + E d ŵ [x ŵ(s, a)] ≲ ϵ model-based stat . (69)We next argue that both terms in inequality above are nonnegative and conclude that-E s,a∼µ [ ŵ(s, a)A ⋆ (s, a)] ≲ ϵ model-based stat (70a) -E d π ŵ [f * (x ŵ(s, a))] + E d ŵ [x ŵ(s, a)] ≲ ϵ model-based stat (70b)The first term is nonnegative because for the optimal advantage function we have A ⋆ (s, a) ≤ 0 for all s ∈ S and a ∈ A. We write the second term as-E d π ŵ [f * (x ŵ(s, a))] + E d ŵ [x ŵ(s, a)] = E d π ŵ d ŵ(s) d π ŵ (s)x ŵ(s, a) -f * (x ŵ(s, a)) .

d w (s) d πw (s) xw (s, a) -f * (x w (s, a)) ≥ 0 ∀s ∈ S, w ∈ W. (71) Proof of bound (71). we separate the argument into three cases and use the expression of xw given in Lemma 8. 1. When 1 -B x /2 ≤ dw(s) d πw (s) ≤ B x /2 + 1, we have xw (s, a) = 2 dw(s) d πw (s) -2 and therefore d w (s) d πw (s) xw (s, a) -f * (x w (s, a)) = d w (s) d πw (s) dw(s) d πw (s) > B x /2 + 1, substitute xw (s, a) = B x to arrive at d w (s)

d ŵ (s) d π ŵ (s) ≤ 12 and thus the variational form falls into the case 3 in the proof of bound (71). Therefore, for s ∈ S s , we have x ŵ(s, a) = -B x andd ŵ(s) d π ŵ (s) x ŵ(s, a) -f * (x ŵ(s, a)) ≳ (1 -γ) 2 ∀s ∈ S s . (73)We use the bound in (70b) as well as (73) to conclude thatϵ model-based stat ≳E d ŵ [x ŵ(s, a)] -E d π ŵ [f * (x ŵ(s, a))] = s d π ŵ (s) d ŵ(s) d π ŵ (s) x ŵ(s, a) -f * (x ŵ(s, a)) ≳ s∈Ss (1 -γ) 2 d π ŵ (s),which leads to the second advertised claim s∈Ss d π ŵ (s) ≲ (1 -γ) -2 ϵ stat .E.8 ALM BASED ON BELLMAN FLOW ERROR CONSTRAINT IS INSUFFICIENT

π ŵ (s) d ŵ (s) being bounded by a constant. We argue this by contradiction. Assume that for 0 ≤ c < 1d π ŵ (s) d ŵ(s) ≤ 1 1 -c ⇐⇒ d π ŵ (s) -d ŵ(s) ≤ cd π ŵ (s). (74)Since d π ŵ satisfies the Bellman flow equations, we can show d π ŵ -d ŵ = (I -γP π ŵ ) -1 ϵ. Moreover, we have d π ŵ = (I -γP π ŵ ) -1 ρ. Substituting these equations to (74), we conclude that for 0 ≤ c < 1 (I -γP π ŵ ) -1 ϵ ≤ c(I -γP π ŵ ) -1 ρ ⇐⇒ ϵ ≤ cρ.

and r i ∼ R(s i , a i ). Let f be M f -strongly convex, differentiable, and nonnegative with bounded values |f (x)|≤ B f and derivative |f ′

Summary of suboptimality bounds on the MIS objective with different added terms.

. One exception is the work ofZanette et al. (2021) that uses value-function perturbation with actor-critic in linear function approximation setting. Other examples include the recent theoretical works on MIS(Zhan et al., 2022;Chen & Jiang, 2022) and adversarially trained actor-critic(Cheng et al., 2022).Most related to our work are methods that focus on provable conservative offline RL under general function approximation and partial coverage, summarized under the pessimistic algorithms segment of TableB.2.Uehara & Sun (2021)  propose a pessimistic model-based algorithm that under a generalization of single-policy concentrability to bounded TV distance ratio, enjoys a 1/√N rate but is computationally intractable. The work ofXie et al. (

Comparison of provable offline RL algorithms with general function approximation.

ACKNOWLEDGMENTS

The authors are grateful to Amy Zhang and Yuandong Tian. This work occurred under Meta AI-BAIR Commons at the University of California, Berkeley, and is partially supported by NSF Grants IIS-1901252 and CCF-1909499. PR is supported by the Open Philanthropy Foundation. Part of the work was done when HZ was a visiting researcher at Meta.

annex

Decompose the objective difference according toWe bound each term:• T 2 ≤ ϵ model-free by part (I);

E.7 PROOF OF LEMMA 11

We provide proof only for the model-based algorithm and let ŵ = ŵmodel-based for notation convenience. The proof for a model-free algorithm follows analogously, noting the fact thatand we can replace Lemma 9 with Lemma 10 to prove the model-free version.

E.7.1 PROOF OF PART (I)

Consider the expression of the model-based objective L model-based AL (w ⋆ , v ⋆ , u ⋆ ) at the optimal solution where u ⋆ := u ⋆ w ⋆ :The first equation comes from the fact that u ⋆ is the optimal solution to the variational lower bound, making it equal to the f -divergence. To see this, recall from Lemma 8 that x ⋆ w (s, a) = 2d w (s)/d πw (s) -2 and xw (s, a) = clip(x ⋆ w (s, a), -B x , B x ). Since x ⋆ w ⋆ (s, a) = 0, we have xw ⋆ (s, a) = x ⋆ w ⋆ (s, a) and thus u ⋆ recovers the f -divergence. In Equation (66), we wrote v ⋆ (s) = V ⋆ (s) since v ⋆ (s) is the optimal solution to the primal-dual program without the ALM term and is equal to the optimal value function (Zhan et al., 2022) . We also used the fact that e v ⋆ (s, a) = r(s, a)+γ s ′ P (s ′ |s, a)v ⋆ (s ′ )-v ⋆ (s) = A ⋆ (s, a) is the optimal advantage function, and that d w ⋆ (s) = d π w ⋆ (s) by definition and realizability of w ⋆ . Moreover, the second term in ( 66) is zero since it captures the optimal advantage of optimal policy. Therefore, we conclude thatFurthermore, we also consider the optimization error of practical optimization algorithms since in real-world scenarios it is unlikely that an algorithm can recover the exact optimal solution. Instead, a typical optimization algorithm is able to find an approximate solution that is close enough to the true optimal solution. Formally, we assume that the solution ( ŵ, v) that the optimizer obtained satisfieswhere the objective L can be substituted by any objective with ALM term in different settings (e.g., L = LCB AL in contextual bandits). The assumption above is also similar to Zhan et al. (2022) , and it assumes that L( ŵ, v) ≈ max w∈W min v∈V L(w, v), which shows that ( ŵ, v) is an approximate max-min solution of L.With definition ( 75) and ( 76), the main result of this section is stated as follows:Theorem 5 (Robust version of Theorem 3) Assume concentrability of an optimal policy π ⋆ (Definition 1) and let w ⋆ (s, a) = d π ⋆ (s, a)/µ(s, a), v ⋆ = J(π ⋆ ). Assume that |v(s)|≤ B v for v ∈ V and 0 ≤ w(s, a) ≤ B w for w ∈ W. Moreover, assume (75) and (76) hold. Then for any fixed δ > 0, policy π returned by Algorithm 2 (where ( ŵ, v) satisfies (76) with L = LCB AL instead of the exact max-min solution as in (8)) achieveswith probability at least 1 -δ, whereNote that we only present the robustness result for contextual bandit settings for conciseness. Similar results also hold for MAB, model-based MDP, and model-free MDP settings. . Also, letBy the same argument as in Appendix D.3, (w, where v(w) = arg min v∈V L(w, v). Each term is bounded as follows:• T 1 = 0 because w ⋆ satisfies the optimization constraints.• T 2 ≤ (B w + B v + 3)ϵ r,w,w ⋆ by Lemma 12.• T 3 ≤ ϵ stat due to part (I) of Lemma 6.• T 4 ≤ ϵ o,w because max w∈W min v∈V L(w, v) -min v∈V L( ŵ, v) ≤ ϵ o,w and w ⋆ W ∈ W.• T 6 ≤ ϵ stat due to part (I) of Lemma 6.• T 7 ≤ (B w + 1)ϵ r,v by Lemma 12.Summing up the bounds on each term, we haveThe remaining steps are the same as Appendix D.3, and we can finally obtain that□ Remark 1 Note that the suboptimality caused by model misspecification and optimization error in our algorithm is of order O(ϵ opt + ϵ mis ). This is much better than the result of Zhan et al. (2022) where the suboptimality caused by model misspecification and optimization error is of order O( (ϵ opt + ϵ mis )/α).Finally, we show and prove the following lemma which is key to the proof of our main theorem (Theorem 5) in this section. This is also similar to Zhan et al. (2022) .Lemma 12 Under the same setting as in Theorem 5, for any v ∈ V and any w 1 , w 2 ∈ W, it holds thatAlso, for any v 1 , v 2 ∈ V and any w ∈ W, it holds thatProof. Recall thatand in contextual bandits we have ρ = µ. Therefore, by definition,Similarly, we haveTheorem 6 (Convergence of MLE for learning transitions (Van de Geer, 2000)) Given a realizable model class P = {P : (S, A) → ∆(S)} that contains the true model P ⋆ and a datasetFix the failure probability δ > 0. Then, with probability at least 1 -δ, we have the following concentration on the squared Hellinger distance between P and P ⋆ :Lemma 13 For any 0 ≤ b i ≤ B and x i , y i ≥ 0 for i ∈ {0, . . . , n}, the following holdsProof. We expend the left-hand side of ( 77), use Cauchy-Schwarz inequality, and then complete the square:□ Lemma 14 For any two arbitrary sets X , Y, let f (•, •) : X × Y → R be an arbitrary function.Let X 0 = {x ∈ X | inf y∈Y f (x, y) > -∞} and assume X 0 is non-empty. For any x ∈ X 0 , assume there exists yProof. Note that for any fixed x, f AL (x, y) is a constant shift of f (x, y), which implies that inf y∈Y f (x, y) > -∞ ⇐⇒ inf y∈Y f AL (x, y) > -∞. This also implies that for any x ∈ X 0 , arg min y∈Y f (x, y) = arg min y∈Y f AL (x, y).For any x ∈ X 0 , let y * (x) denote any one of y ∈ Y s.t. f (x, y * (x)) = min y∈Y f (x, y).

Now for any x

=⇒f AL (x * , y * (x * )) ≥ f AL (x, y * (x)), ∀x ∈ X 0 . For the other direction, given any x 0 ∈ arg max x∈X0 min y∈Y f AL (x, y), we have min y∈Y f AL (x 0 , y) ≥ min y∈Y f AL (x, y), ∀x ∈ X 0 .

=⇒ min

=⇒f AL (x 0 , y * (x 0 )) ≥ f AL (x, y * (x)), ∀x ∈ X 0 .(78) Fix any x * ∈ X * ⊆ X 0 , we have f (x * , y * (x * )) ≥ f (x 0 , y * (x 0 )) and -A(x * ) ≥ -A(x 0 ) by definition. Now assume x 0 / ∈ X * . Then either f (x * , y * (x * )) > f (x 0 , y * (x 0 )) if x 0 / ∈ X * p , or -A(x * ) > -A(x 0 ) if x 0 ∈ X * p \X * . Either one of the above two conditions implies that f (x * , y * (x * )) -A(x * ) > f (x 0 , y * (x 0 )) -A(x 0 ) =⇒ f AL (x * , y * (x * )) > f AL (x 0 , y * (x 0 )), which contradicts with (78). Therefore, x 0 ∈ X * . □

