NEARLY MINIMAX OPTIMAL OFFLINE REINFORCE-MENT LEARNING WITH LINEAR FUNCTION APPROXI-MATION: SINGLE-AGENT MDP AND MARKOV GAME

Abstract

Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimality has only been (nearly) established for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear function approximation and propose a new pessimism-based algorithm for offline linear MDP. At the core of our algorithm is the uncertainty decomposition via a reference function, which is new in the literature of offline RL under linear function approximation. Theoretical analysis demonstrates that our algorithm can match the performance lower bound up to logarithmic factors. We also extend our techniques to the two-player zero-sum Markov games (MGs), and establish a new performance lower bound for MGs, which tightens the existing result, and verifies the nearly minimax optimality of the proposed algorithm. To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation.

1. INTRODUCTION

Reinforcement learning (RL) has achieved tremendous empirical success in both single-agent (Kober et al., 2013) and multi-agent scenarios (Silver et al., 2016; 2017) . Two components play a critical role -function approximations and efficient simulators. For RL problems with a large (or even infinite) number of states, storing a table as in classical Q-learning is generally infeasible. In these cases, practical algorithms (Mnih et al., 2015; Lillicrap et al., 2015; Schulman et al., 2015; 2017; Haarnoja et al., 2018) approximate the true value function or policy by a function class (e.g., neural networks). Meanwhile, an efficient simulator allows for learning a policy in an online trial-and-error fashion using millions or even billions of trajectories. However, due to the limited availability of data samples in many practical applications, e.g., healthcare (Wang et al., 2018) and autonomous driving (Pan et al., 2017) , instead of collecting new trajectories, we may have to extrapolate knowledge only from past experiences, i.e., a pre-collected dataset. This type of RL problems is usually referred to as offline RL or batch RL (Lange et al., 2012; Levine et al., 2020 ). An offline RL algorithm is usually measured by its sample complexity to achieve the desired statistical accuracy. A line of works (Xie et al., 2021b; Shi et al., 2022; Li et al., 2022) demonstrates that nearoptimal sample complexity in tabular single-agent MDPs is attainable. However, these algorithms cannot solve the problem with large or infinite state spaces where function approximation is involved. To our best knowledge, existing algorithms cannot attain the statistical limit even for linear function * The first two authors contribute equally. 1

