NEARLY MINIMAX OPTIMAL OFFLINE REINFORCE-MENT LEARNING WITH LINEAR FUNCTION APPROXI-MATION: SINGLE-AGENT MDP AND MARKOV GAME

Abstract

Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimality has only been (nearly) established for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear function approximation and propose a new pessimism-based algorithm for offline linear MDP. At the core of our algorithm is the uncertainty decomposition via a reference function, which is new in the literature of offline RL under linear function approximation. Theoretical analysis demonstrates that our algorithm can match the performance lower bound up to logarithmic factors. We also extend our techniques to the two-player zero-sum Markov games (MGs), and establish a new performance lower bound for MGs, which tightens the existing result, and verifies the nearly minimax optimality of the proposed algorithm. To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation. * The first two authors contribute equally.

1. INTRODUCTION

Reinforcement learning (RL) has achieved tremendous empirical success in both single-agent (Kober et al., 2013) and multi-agent scenarios (Silver et al., 2016; 2017) . Two components play a critical role -function approximations and efficient simulators. For RL problems with a large (or even infinite) number of states, storing a table as in classical Q-learning is generally infeasible. In these cases, practical algorithms (Mnih et al., 2015; Lillicrap et al., 2015; Schulman et al., 2015; 2017; Haarnoja et al., 2018) approximate the true value function or policy by a function class (e.g., neural networks). Meanwhile, an efficient simulator allows for learning a policy in an online trial-and-error fashion using millions or even billions of trajectories. However, due to the limited availability of data samples in many practical applications, e.g., healthcare (Wang et al., 2018) and autonomous driving (Pan et al., 2017) , instead of collecting new trajectories, we may have to extrapolate knowledge only from past experiences, i.e., a pre-collected dataset. This type of RL problems is usually referred to as offline RL or batch RL (Lange et al., 2012; Levine et al., 2020 ). An offline RL algorithm is usually measured by its sample complexity to achieve the desired statistical accuracy. A line of works (Xie et al., 2021b; Shi et al., 2022; Li et al., 2022) demonstrates that nearoptimal sample complexity in tabular single-agent MDPs is attainable. However, these algorithms cannot solve the problem with large or infinite state spaces where function approximation is involved. To our best knowledge, existing algorithms cannot attain the statistical limit even for linear function approximation, which is arguably the simplest function approximation setting. Specifically, for linear function approximation, Jin et al. (2021c) proposes the first efficient algorithm for offline linear MDPs, but their upper bound is suboptimal compared with the existing lower bounds in Jin et al. (2021c); Zanette et al. (2021 ). Recently, Yin et al. (2022) tries to improve the result by incorporating variance information in the algorithmic design of offline MDPs with linear function approximation. However, a careful examination reveals a technical gap, and some additional assumptions may be needed to fix it (cf. Section 3 and Section 5). Beyond the single-agent MDPs, Zhong et al. ( 2022) studies the Markov games (MGs) with linear function approximation and provides the only provably efficient algorithm with a suboptimal result. Therefore, the following problem remains open: Can we design computationally efficient offline RL algorithms for problems with linear function approximation that are nearly minimax optimal? In this paper, we first answer this question affirmatively under linear MDPs (Jin et al., 2020) and then extend our results to the two-player zero-sum Markov games (MGs) (Xie et al., 2020) . Our contributions are summarized as follows: • We identify an implicit and restrictive assumption required by existing approaches in the literature, which originates from omitting the complicated temporal dependency between different time steps. See Section 3 for a detailed explanation. • We handle the temporal dependency by an uncertainty decomposition technique via a reference function, thus closing the gap to the information-theoretic lower bound without the restrictive independence assumption. The uncertainty decomposition serves to avoid a √ d-amplification of the value function error and also the measurability issue from incorporating variance information to improve the H-dependence, where d and H are the feature dimension and planning horizon, respectively. To the best of our knowledge, this technique is new in the literature of offline learning under linear function approximation. • We further generalize the developed techniques to two-player zero-sum linear MGs (Xie et al., 2020) , thus demonstrating the broad adaptability of our methods. Meanwhile, we establish a new performance lower bound for MGs, which tightens the existing results, and verifies the nearly minimax optimality of the proposed algorithm. 1.1 RELATED WORK Due to space limit, we defer a comprehensive review of related work to Appendix A.2 but focus on the works that are most related to the problem setup and our algorithmic designs. Offline RL with Linear Function Approximation. et al., 2021; Zhong et al., 2022) . The amplification results from the statistical dependency between different time steps. After these, Min et al. (2021) studies the offline policy evaluation (OPE) in linear MDP with an additional independence assumption that the data samples between different time steps are independent thus circumventing this core issue. Yin et al. (2022) studies the policy optimization in linear MDP, which also (implicitly) require the independence assumption. Another line of work addresses the error amplification from temporal dependency with different algorithmic designs. Zanette et al. ( 2020) designs an actor-critic-based algorithm and establishes pessimism via direct perturbations of the parameter vectors in a linear function approximation scheme. Xie et al. (2021a); Uehara and Sun (2021) establish pessimism only at the initial state but at the expense of computational tractability. These algorithmic ideas are fundamentally different from ours and do not apply to the LSVI-type algorithms. We will compare the proposed algorithms with them in Section 5.

2. OFFLINE LINEAR MARKOV DECISION PROCESS

Notations. Given a semi-definite matrix Λ and a vector u, we denote √ u ⊤ Λu as ∥u∥ Λ . The 2norm of a vector w is ∥w∥ 2 . We also denote λ min (A) as the smallest eigenvalue of the matrix A. The subscript of f (x) [0,M ] means that we clip the value f (x) to the range of [0, M ], i.e.,



Jin et al. (2021c)  andZhong et al. (2022)  provide the first results for offline linear MDPs and two-player zero-sum linear MGs, respectively. However, their algorithms are based on Least-Squares Value Iteration (LSVI), and establish pessimism by adding bonuses at every time step thus suffering from a √ d-amplification to the lower bound (Zanette

